Part 1: EDA, Cleaning, and Preprocessing¶
In this section, we will:
- Load and inspect the Spotify dataset.
- Perform exploratory data analysis (EDA) to understand the data structure.
- Clean and preprocess the data by handling missing values and removing irrelevant columns.
InĀ [2]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Set visual styles
sns.set(style='whitegrid')
# Optional: Widen display for pandas
pd.set_option('display.max_columns', None)
InĀ [3]:
df = pd.read_csv('spotify_data.csv')
InĀ [4]:
# Preview the first few rows of the dataset
df.head()
Out[4]:
| Unnamed: 0 | artist_name | track_name | track_id | popularity | year | genre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | time_signature | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Jason Mraz | I Won't Give Up | 53QF56cjZA9RTuuMZDrSA6 | 68 | 2012 | acoustic | 0.483 | 0.303 | 4 | -10.058 | 1 | 0.0429 | 0.6940 | 0.000000 | 0.1150 | 0.139 | 133.406 | 240166 | 3 |
| 1 | 1 | Jason Mraz | 93 Million Miles | 1s8tP3jP4GZcyHDsjvw218 | 50 | 2012 | acoustic | 0.572 | 0.454 | 3 | -10.286 | 1 | 0.0258 | 0.4770 | 0.000014 | 0.0974 | 0.515 | 140.182 | 216387 | 4 |
| 2 | 2 | Joshua Hyslop | Do Not Let Me Go | 7BRCa8MPiyuvr2VU3O9W0F | 57 | 2012 | acoustic | 0.409 | 0.234 | 3 | -13.711 | 1 | 0.0323 | 0.3380 | 0.000050 | 0.0895 | 0.145 | 139.832 | 158960 | 4 |
| 3 | 3 | Boyce Avenue | Fast Car | 63wsZUhUZLlh1OsyrZq7sz | 58 | 2012 | acoustic | 0.392 | 0.251 | 10 | -9.845 | 1 | 0.0363 | 0.8070 | 0.000000 | 0.0797 | 0.508 | 204.961 | 304293 | 4 |
| 4 | 4 | Andrew Belle | Sky's Still Blue | 6nXIYClvJAfi6ujLiKqEq8 | 54 | 2012 | acoustic | 0.430 | 0.791 | 6 | -5.419 | 0 | 0.0302 | 0.0726 | 0.019300 | 0.1100 | 0.217 | 171.864 | 244320 | 4 |
InĀ [5]:
# Get basic info about columns, data types, and non-null counts
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1159764 entries, 0 to 1159763 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 1159764 non-null int64 1 artist_name 1159749 non-null object 2 track_name 1159763 non-null object 3 track_id 1159764 non-null object 4 popularity 1159764 non-null int64 5 year 1159764 non-null int64 6 genre 1159764 non-null object 7 danceability 1159764 non-null float64 8 energy 1159764 non-null float64 9 key 1159764 non-null int64 10 loudness 1159764 non-null float64 11 mode 1159764 non-null int64 12 speechiness 1159764 non-null float64 13 acousticness 1159764 non-null float64 14 instrumentalness 1159764 non-null float64 15 liveness 1159764 non-null float64 16 valence 1159764 non-null float64 17 tempo 1159764 non-null float64 18 duration_ms 1159764 non-null int64 19 time_signature 1159764 non-null int64 dtypes: float64(9), int64(7), object(4) memory usage: 177.0+ MB
InĀ [6]:
# Check the shape of the dataset (rows, columns)
print(f"Dataset Shape: {df.shape}")
Dataset Shape: (1159764, 20)
InĀ [7]:
df.drop('Unnamed: 0',axis = 1,inplace = True)
InĀ [8]:
# Basic statistics
print("\nBasic statistics:")
print(df.describe())
Basic statistics:
popularity year danceability energy key \
count 1.159764e+06 1.159764e+06 1.159764e+06 1.159764e+06 1.159764e+06
mean 1.838312e+01 2.011955e+03 5.374382e-01 6.396699e-01 5.287778e+00
std 1.588554e+01 6.803901e+00 1.844780e-01 2.705009e-01 3.555197e+00
min 0.000000e+00 2.000000e+03 0.000000e+00 0.000000e+00 0.000000e+00
25% 5.000000e+00 2.006000e+03 4.130000e-01 4.540000e-01 2.000000e+00
50% 1.500000e+01 2.012000e+03 5.500000e-01 6.940000e-01 5.000000e+00
75% 2.900000e+01 2.018000e+03 6.770000e-01 8.730000e-01 8.000000e+00
max 1.000000e+02 2.023000e+03 9.930000e-01 1.000000e+00 1.100000e+01
loudness mode speechiness acousticness \
count 1.159764e+06 1.159764e+06 1.159764e+06 1.159764e+06
mean -8.981353e+00 6.346533e-01 9.281477e-02 3.215370e-01
std 5.682215e+00 4.815275e-01 1.268409e-01 3.549872e-01
min -5.810000e+01 0.000000e+00 0.000000e+00 0.000000e+00
25% -1.082900e+01 0.000000e+00 3.710000e-02 6.400000e-03
50% -7.450000e+00 1.000000e+00 5.070000e-02 1.470000e-01
75% -5.276000e+00 1.000000e+00 8.900000e-02 6.400000e-01
max 6.172000e+00 1.000000e+00 9.710000e-01 9.960000e-01
instrumentalness liveness valence tempo \
count 1.159764e+06 1.159764e+06 1.159764e+06 1.159764e+06
mean 2.523489e-01 2.230189e-01 4.555636e-01 1.213771e+02
std 3.650731e-01 2.010707e-01 2.685190e-01 2.977975e+01
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 1.050000e-06 9.790000e-02 2.260000e-01 9.879700e+01
50% 1.760000e-03 1.340000e-01 4.380000e-01 1.219310e+02
75% 6.140000e-01 2.920000e-01 6.740000e-01 1.399030e+02
max 1.000000e+00 1.000000e+00 1.000000e+00 2.499930e+02
duration_ms time_signature
count 1.159764e+06 1.159764e+06
mean 2.495618e+05 3.885879e+00
std 1.494262e+05 4.676967e-01
min 2.073000e+03 0.000000e+00
25% 1.810910e+05 4.000000e+00
50% 2.257440e+05 4.000000e+00
75% 2.869135e+05 4.000000e+00
max 6.000495e+06 5.000000e+00
InĀ [9]:
# Check for missing values in each column
df.isnull().sum()
Out[9]:
artist_name 15 track_name 1 track_id 0 popularity 0 year 0 genre 0 danceability 0 energy 0 key 0 loudness 0 mode 0 speechiness 0 acousticness 0 instrumentalness 0 liveness 0 valence 0 tempo 0 duration_ms 0 time_signature 0 dtype: int64
InĀ [10]:
df.dropna(inplace = True)
df.describe()
df.shape
Out[10]:
(1159748, 19)
Feature Engineering: Artist Popularity¶
In this step, we calculate the average popularity for each artist, categorize them into bins, and visualize the distribution.
InĀ [11]:
# Add a new column 'artist_popularity' with the median popularity of each artist's songs
df['artist_popularity'] = df.groupby('artist_name')['popularity'].transform('median')
# Preview the new feature
df[['artist_name', 'artist_popularity']].head()
Out[11]:
| artist_name | artist_popularity | |
|---|---|---|
| 0 | Jason Mraz | 23.0 |
| 1 | Jason Mraz | 23.0 |
| 2 | Joshua Hyslop | 21.0 |
| 3 | Boyce Avenue | 37.0 |
| 4 | Andrew Belle | 29.0 |
InĀ [12]:
# Plot the distribution of artist popularity scores
plt.figure(figsize=(10, 5))
sns.kdeplot(df['artist_popularity'], color='green')
plt.xlabel("Artist Popularity (Median Popularity of Songs)")
plt.ylabel("Density")
plt.title("Artist Popularity Distribution (KDE Plot)")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
InĀ [13]:
df['artist_popularity'].describe()
Out[13]:
count 1.159748e+06 mean 1.690622e+01 std 1.382647e+01 min 0.000000e+00 25% 6.000000e+00 50% 1.500000e+01 75% 2.600000e+01 max 8.500000e+01 Name: artist_popularity, dtype: float64
InĀ [14]:
# Define real-world artist popularity bins
bins = [0, 25,50,75, 100]
labels = ['Underground', 'Emerging', 'Mainstream', 'Superstars']
# Assign bins based on real-world artist popularity categorization
df['artist_popularity_bin'] = pd.cut(df['artist_popularity'], bins=bins, labels=labels, include_lowest=True)
df['artist_popularity_bin'].value_counts()
Out[14]:
artist_popularity_bin Underground 864791 Emerging 270974 Mainstream 23884 Superstars 99 Name: count, dtype: int64
Popularity Score Distribution & Zero-Popularity Filtering¶
InĀ [15]:
# Visualize the distribution of song popularity scores.
# This helps identify data skewness and the presence of zero-popularity songs.
plt.figure(figsize=(10, 5))
plt.hist(df['popularity'], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Popularity')
plt.ylabel('Count')
plt.title('Distribution of Popularity Scores')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
# Count how many songs have a popularity score of zero.
zero_popularity_count = (df['popularity'] == 0).sum()
print(f"Number of songs with zero popularity: {zero_popularity_count}")
# Remove songs with zero popularity from the dataset.
# These entries are likely unlisted, unreleased, or obscure tracks with no listener engagement.
df = df[df['popularity'] > 0]
# Confirm the dataset shape after filtering.
print(f"Dataset shape after removing zero-popularity songs: {df.shape}")
Number of songs with zero popularity: 158391 Dataset shape after removing zero-popularity songs: (1001357, 21)
InĀ [16]:
current_year = df['year'].max()
df_current = df[df['year'] == current_year]
# Top 10 most popular songs
top_songs = df_current.nlargest(10, 'popularity')[['track_name', 'popularity']]
# Top 10 most popular artists
top_artists = df_current.groupby('artist_name')['popularity'].mean().nlargest(10)
colors = [(0, 0, 1, alpha) for alpha in reversed([0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.0])]
# Plot for top songs
plt.figure(figsize=(10, 5))
plt.barh(top_songs['track_name'], top_songs['popularity'], color=colors)
plt.xlabel('Popularity')
plt.ylabel('Songs')
plt.title(f'Top 10 Most Popular Songs in {current_year}')
plt.gca().invert_yaxis()
plt.show()
# Plot for top artists
plt.figure(figsize=(10, 5))
plt.barh(top_artists.index, top_artists.values, color=colors)
plt.xlabel('Average Popularity')
plt.ylabel('Artists')
plt.title(f'Top 10 Most Popular Artists in {current_year}')
plt.gca().invert_yaxis()
plt.show()
Genre-Based EDA & Filtering¶
In this step, we:
- Explore the distribution of genres in the dataset.
- Identify the top 10 genres based on their average song popularity.
- Filter the dataset to focus on these top genres for better modeling insights.
InĀ [17]:
# List all unique genres
unique_genres = df['genre'].unique()
print(f"Number of unique genres: {len(unique_genres)}")
print("Unique genres:", unique_genres)
Number of unique genres: 82 Unique genres: ['acoustic' 'afrobeat' 'alt-rock' 'ambient' 'black-metal' 'blues' 'breakbeat' 'cantopop' 'chicago-house' 'chill' 'classical' 'club' 'comedy' 'country' 'dance' 'dancehall' 'death-metal' 'deep-house' 'detroit-techno' 'disco' 'drum-and-bass' 'dub' 'dubstep' 'edm' 'electro' 'electronic' 'emo' 'folk' 'forro' 'french' 'funk' 'garage' 'german' 'gospel' 'goth' 'grindcore' 'groove' 'guitar' 'hard-rock' 'hardcore' 'hardstyle' 'heavy-metal' 'hip-hop' 'house' 'indian' 'indie-pop' 'industrial' 'jazz' 'k-pop' 'metal' 'metalcore' 'minimal-techno' 'new-age' 'opera' 'party' 'piano' 'pop' 'pop-film' 'power-pop' 'progressive-house' 'psych-rock' 'punk' 'punk-rock' 'rock' 'rock-n-roll' 'romance' 'sad' 'salsa' 'samba' 'sertanejo' 'show-tunes' 'singer-songwriter' 'ska' 'sleep' 'songwriter' 'soul' 'spanish' 'swedish' 'tango' 'techno' 'trance' 'trip-hop']
InĀ [18]:
# Count how many songs are in each genre
plt.figure(figsize=(22, 22))
genre_counts = df['genre'].value_counts()
sns.barplot(y=genre_counts.index, x=genre_counts.values, palette="magma")
plt.xlabel("Number of Tracks")
plt.ylabel("Genre")
plt.title("Number of Tracks per Genre (All Genres)")
plt.show()
C:\Users\samue\AppData\Local\Temp\ipykernel_24684\2455993117.py:4: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(y=genre_counts.index, x=genre_counts.values, palette="magma")
InĀ [19]:
# Identify the top 10 genres by mean popularity
top_genres = df.groupby('genre')['popularity'].mean().sort_values(ascending=False).head(10)
print("Top 10 Genres by Average Popularity:\n", top_genres)
Top 10 Genres by Average Popularity: genre pop 56.063459 rock 47.517498 hip-hop 46.561844 dance 43.247913 metal 39.800914 alt-rock 38.826969 sad 36.560785 indie-pop 35.689192 folk 33.665069 country 33.134503 Name: popularity, dtype: float64
InĀ [20]:
plt.figure(figsize=(14, 20))
genre_popularity = df.groupby('genre')['popularity'].mean().sort_values()
sns.barplot(y=genre_popularity.index, x=genre_popularity.values, palette="coolwarm")
plt.xlabel("Average Popularity")
plt.ylabel("Genre")
plt.title("Average Popularity per Genre ")
plt.show()
C:\Users\samue\AppData\Local\Temp\ipykernel_24684\2623246707.py:3: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(y=genre_popularity.index, x=genre_popularity.values, palette="coolwarm")
Focusing on top 10 genres¶
InĀ [21]:
# Filter the dataset to include only the top 10 genres
df_final = df[df['genre'].isin(top_genres.index)]
# Confirm the shape of the filtered dataset
print(f"Shape of df_final (top genres only): {df_final.shape}")
# Check which genres are left in df_final
print("Genres in df_final:", df_final['genre'].unique())
df_final['genre'].value_counts()
Shape of df_final (top genres only): (119854, 21) Genres in df_final: ['alt-rock' 'country' 'dance' 'folk' 'hip-hop' 'indie-pop' 'metal' 'pop' 'rock' 'sad']
Out[21]:
genre alt-rock 20794 country 17836 dance 17127 folk 16066 hip-hop 15620 indie-pop 9974 metal 7002 pop 6193 sad 6013 rock 3229 Name: count, dtype: int64
InĀ [22]:
plt.figure(figsize=(12, 6))
sns.barplot(y=top_genres.index, x=top_genres.values, palette="coolwarm")
plt.xlabel("Average Popularity")
plt.ylabel("Genre")
plt.title("Top 10 Genres by Average Popularity")
plt.show()
C:\Users\samue\AppData\Local\Temp\ipykernel_24684\883393045.py:2: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(y=top_genres.index, x=top_genres.values, palette="coolwarm")
InĀ [23]:
df_top_genres = df_final[df_final['genre'].isin(top_genres.index)]
total_tracks = len(df)
top_genres_tracks = df[df['genre'].isin(top_genres.index)].shape[0]
contribution_percentage = (top_genres_tracks / total_tracks) * 100
print('Total tracks:', total_tracks)
print('Total tracks in top genres:', top_genres_tracks)
print('Contribution percentage:', contribution_percentage)
plt.figure(figsize=(12, 6))
sns.boxplot(x="genre", y="popularity", data=df_top_genres, palette="Set2")
plt.xticks(rotation=45, ha='right')
plt.xlabel("Genre")
plt.ylabel("Popularity")
plt.title("Popularity Distribution for Top 10 Genres")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Total tracks: 1001357 Total tracks in top genres: 119854 Contribution percentage: 11.969157852793758
C:\Users\samue\AppData\Local\Temp\ipykernel_24684\1819939053.py:12: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.boxplot(x="genre", y="popularity", data=df_top_genres, palette="Set2")
InĀ [24]:
#most popular artists
most_popular_artists = df_final.groupby("artist_name")["popularity"].median().sort_values(ascending=False).head(20)
most_popular_artists
Out[24]:
artist_name Elley Duhé 85.0 Rema 82.0 Chani Nattan 80.0 Jogja Hip Hop Foundation 80.0 Justine Skye 80.0 Isabel LaRosa 80.0 SunKissed Lola 80.0 Fujii Kaze 80.0 Cian Ducrot 79.0 Oxlade 79.0 Soegi Bornean 78.0 Aditya A 78.0 Duki 77.5 Shubh 77.5 Lasso 77.5 Andra & The Backbone 77.0 Maria Becerra 77.0 TINI 77.0 Marc Seguà 77.0 Olivia Rodrigo 76.5 Name: popularity, dtype: float64
InĀ [25]:
#most popular tracks
most_popular_tracks = df_final.groupby("track_name")["popularity"].median().sort_values(ascending=False).head(20)
most_popular_tracks
Out[25]:
track_name Shakira: Bzrp Music Sessions, Vol. 53 96.0 Die For You - Remix 95.0 Kill Bill 94.0 I'm Good (Blue) 93.0 Calm Down (with Selena Gomez) 93.0 La Bachata 93.0 Unholy (feat. Kim Petras) 92.0 Quevedo: Bzrp Music Sessions, Vol. 52 92.0 AMG 91.0 Yandel 150 91.0 Until I Found You (with Em Beihold) - Em Beihold Version 91.0 Escapism. 90.0 PRC 90.0 Rich Flex 90.0 Romantic Homicide 90.0 Hey Mor 90.0 Gato de Noche 89.0 CHORRITO PA LAS ANIMAS 89.0 Tormenta (feat. Bad Bunny) 89.0 Que Vuelvas 89.0 Name: popularity, dtype: float64
InĀ [26]:
#find most popular artist_by_genre
most_popular_artist_by_genre = df_final.loc[df_final.groupby("genre")["popularity"].idxmax(), ["genre", "artist_name", "track_name", "popularity"]]
most_popular_artist_by_genre
Out[26]:
| genre | artist_name | track_name | popularity | |
|---|---|---|---|---|
| 162916 | alt-rock | The Neighbourhood | Daddy Issues | 84 |
| 590794 | country | Morgan Wallen | Last Night | 88 |
| 541577 | dance | David Guetta | I'm Good (Blue) | 93 |
| 549416 | folk | Lizzy McAlpine | ceilings | 89 |
| 605178 | hip-hop | Bizarrap | Shakira: Bzrp Music Sessions, Vol. 53 | 96 |
| 561970 | indie-pop | JVKE | golden hour | 89 |
| 608995 | metal | Linkin Park | Lost | 84 |
| 612503 | pop | Miley Cyrus | Flowers | 100 |
| 297308 | rock | Imagine Dragons | Believer | 86 |
| 574169 | sad | Natanael Cano | AMG | 91 |
InĀ [27]:
# Top 5 genres based on popularity
fig = px.bar(df_final.nlargest(10, 'popularity'), x='genre', y=['popularity'], barmode='group')
fig.update_layout(yaxis_title="popularity", xaxis_title="genre", title = f'Top 5 genres based on popularity')
fig.update_layout(title={'x': 0.5, 'xanchor': 'center'})
fig.show()
InĀ [28]:
# Select 500,000 random rows from the rest of the dataset (genres not in top 10)
df_other_genres = df[~df['genre'].isin(top_genres)].sample(n=500000, random_state=42)
# Add these rows to df_final
df_final = pd.concat([df_final, df_other_genres], ignore_index=True)
# We are adding an additional 500,000 rows that don't include the top 10 genres as it will provide the model more context to predict popularity of music if a song they release doesn't fall into one of the top 10 genres
# save the final dataset
final_dataset_path = "final_spotify_dataset.csv"
df_final.to_csv(final_dataset_path, index=False)
final_dataset_path
df_final.shape
Out[28]:
(619854, 21)
Top 100 most popular songs EDA¶
InĀ [29]:
top_100_songs = df_final.nlargest(100, 'popularity')[['track_name', 'artist_name', 'genre', 'popularity']]
top_100_songs
# Count the number of songs in each genre from the top 100
top_100_genre_distribution = top_100_songs['genre'].value_counts()
top_100_genre_distribution
Out[29]:
genre pop 45 hip-hop 22 dance 13 sad 8 indie-pop 3 country 3 folk 2 garage 1 k-pop 1 sertanejo 1 rock 1 Name: count, dtype: int64
InĀ [30]:
# Distribution of track popularity
plt.figure(figsize=(10, 5))
sns.histplot(df_final['popularity'], bins=30, kde=True, color='blue', alpha=0.7)
plt.xlabel('Popularity')
plt.ylabel('Count')
plt.title('Distribution of Track Popularity')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
More EDA¶
InĀ [31]:
numerical_columns = df_final[['popularity', 'year', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']]
corr_matrix = numerical_columns.corr()
corr_matrix['popularity'].sort_values(ascending=False)
Out[31]:
popularity 1.000000 year 0.313431 danceability 0.164561 loudness 0.141909 time_signature 0.033556 valence 0.019359 energy 0.015626 tempo 0.007494 key 0.001824 mode -0.021348 speechiness -0.023218 acousticness -0.085553 liveness -0.088008 duration_ms -0.132001 instrumentalness -0.211257 Name: popularity, dtype: float64
InĀ [32]:
cmap = sns.diverging_palette(230, 20, as_cmap=True)
plt.figure(figsize=(15, 10))
sns.heatmap(numerical_columns.corr(), annot=True, fmt='.1g', vmin=-1, vmax=1, center=0, cmap=cmap)
plt.title("Correlation Matrix", fontweight='bold', fontsize='large')
Out[32]:
Text(0.5, 1.0, 'Correlation Matrix')
InĀ [33]:
# Create a grid of 4 rows and 3 columns of subplots (to fit 12 plots)
fig, ax = plt.subplots(4, 3, figsize=(15, 12))
# Define the columns to plot
columns = ["popularity", "danceability", "energy", "key", "loudness", "speechiness", "acousticness",
"instrumentalness", "liveness", "valence", "tempo", "duration_ms"]
# Define distinct colors for each plot
colors = ['blue', 'red', 'green', 'purple', 'orange', 'brown', 'pink', 'gray', 'cyan', 'yellow', 'teal', 'magenta']
# Iterate over each column and plot its histogram in the corresponding subplot
for i, col in enumerate(columns):
row, col_idx = divmod(i, 3) # Determine subplot grid position
ax[row, col_idx].hist(df_final[col], bins=30, color=colors[i], alpha=0.7)
ax[row, col_idx].set_title(col)
# Adjust layout to prevent overlap
plt.tight_layout()
# Display the plots
plt.show()
InĀ [34]:
# Total number of unique artists
total_artists = df_final['artist_name'].nunique()
print(f'Total number of unique artists: {total_artists}')
Total number of unique artists: 53531
InĀ [35]:
# Artist with most number of tracks
# Count tracks per artist
artist_counts = df_final['artist_name'].value_counts().reset_index()
# Rename columns
artist_counts.columns = ['Artist', 'Number of Tracks']
artist_counts.head(10)
Out[35]:
| Artist | Number of Tracks | |
|---|---|---|
| 0 | Grateful Dead | 1469 |
| 1 | Johann Sebastian Bach | 1035 |
| 2 | Traditional | 849 |
| 3 | Elvis Presley | 644 |
| 4 | $uicideboy$ | 482 |
| 5 | Hans Zimmer | 433 |
| 6 | Vybz Kartel | 431 |
| 7 | Ludwig van Beethoven | 424 |
| 8 | Wolfgang Amadeus Mozart | 414 |
| 9 | Armin van Buuren | 396 |
InĀ [36]:
# Average Song Duration Over the Years
df_final['duration_mins'] = df_final['duration_ms'] / 60000
def visualize_duration_vs_year():
year_duration = df_final.groupby('year')['duration_mins'].mean()
plt.figure(figsize=(10, 6))
sns.lineplot(x=year_duration.index, y=year_duration.values, color='blue')
plt.title('Average Song Duration Over the Years')
plt.xlabel('Year')
plt.ylabel('Average Duration (minutes)')
plt.grid(True)
plt.tight_layout()
plt.show()
visualize_duration_vs_year()
df_final.drop(columns=['duration_mins'], inplace=True)
InĀ [37]:
# Top 5 artists based on popularity and their associated features
fig = px.bar(df_final.nlargest(5, 'popularity'), x='artist_name', y=['valence', 'energy', 'danceability', 'acousticness'], barmode='group')
fig.update_layout(yaxis_title="audio feature", xaxis_title="artist", title = f'Top 5 artists based on popularity and their audio features')
fig.update_layout(title={'x': 0.5, 'xanchor': 'center'})
fig.show()
Part 2: Finding Our Best Overall Model¶
InĀ [38]:
# Import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, StackingRegressor, GradientBoostingRegressor, BaggingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
import shap
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
import joblib
c:\Users\samue\Anaconda3\envs\ml1\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
InĀ [39]:
# drop unneeded variables before running our model (artist_unpopularity_bin actually just creates noise as we already have artist_popularity)
df_final.drop(columns=['artist_popularity_bin'], inplace=True)
# also drop the key variable as what key the song is in makes no difference in how popular it should be and will just create noise
df_final.drop(columns=['key'], inplace=True)
# Define categorical and numerical features
categorical_columns = ['genre', 'time_signature'] #time signature is a categorical variable even if it appears as numerical
numerical_columns = ['year', 'danceability', 'energy', 'loudness',
'mode', 'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms', 'artist_popularity']
# One-Hot Encode categorical features
df_final = pd.get_dummies(df_final, columns=categorical_columns, drop_first=True)
# Scale only the necessary numerical features
features_to_scale = ['year', 'tempo', 'duration_ms','artist_popularity']
scaler = StandardScaler()
df_final[features_to_scale] = scaler.fit_transform(df_final[features_to_scale])
# Define target and features FIRST before converting to float
X = df_final.drop(columns=['popularity', 'track_id', 'track_name', 'artist_name'], axis=1)
y = df_final['popularity']
# Ensure all X columns are float (for SHAP and other numeric models)
X = X.astype(float)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Final dataset shape: {X_train.shape}, {X_test.shape}")
# We don't use PCA here as PCA reduces interpretability and Treebased models like Random Forest (which was already our best performing model) doesn't require PCA
Final dataset shape: (495883, 98), (123971, 98)
InĀ [40]:
### **STAGE 1: TRAIN ALL MODELS QUICKLY WITH DEFAULT PARAMETERS**
# Train Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
ridge_y_pred = ridge.predict(X_test)
ridge_rmse = np.sqrt(mean_squared_error(y_test, ridge_y_pred))
ridge_r2 = r2_score(y_test, ridge_y_pred)
# Train Lasso Regression
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)
lasso_y_pred = lasso.predict(X_test)
lasso_rmse = np.sqrt(mean_squared_error(y_test, lasso_y_pred))
lasso_r2 = r2_score(y_test, lasso_y_pred)
# Train Random Forest
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
random_forest_model.fit(X_train, y_train)
rf_y_pred = random_forest_model.predict(X_test)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_y_pred))
rf_r2 = r2_score(y_test, rf_y_pred)
# Train XGBoost
xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, eval_metric='rmse', random_state=42)
xgb_model.fit(X_train, y_train)
xgb_y_pred = xgb_model.predict(X_test)
xgb_rmse = np.sqrt(mean_squared_error(y_test, xgb_y_pred))
xgb_r2 = r2_score(y_test, xgb_y_pred)
# not running a Support Vector Regression on the final code because it always performed with worse RMSE and R^2 than our other best models on samples, and it was to computationally intense to run on the full dataset
# Train Support Vector Regression (SVR)
#svr_model = SVR(kernel='rbf', C=100, gamma=0.1)
#svr_model.fit(X_train, y_train)
#svr_y_pred = svr_model.predict(X_test)
#svr_rmse = np.sqrt(mean_squared_error(y_test, svr_y_pred))
#svr_r2 = r2_score(y_test, svr_y_pred)
# Train Gradient Boosting Regressor (GBR)
gbr_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gbr_model.fit(X_train, y_train)
gbr_y_pred = gbr_model.predict(X_test)
gbr_rmse = np.sqrt(mean_squared_error(y_test, gbr_y_pred))
gbr_r2 = r2_score(y_test, gbr_y_pred)
# Evaluate Models
models = {
"Ridge Regression": (ridge_rmse, ridge_r2),
"Lasso Regression": (lasso_rmse, lasso_r2),
"Random Forest": (rf_rmse, rf_r2),
"XGBoost": (xgb_rmse, xgb_r2),
# "Support Vector Regression": (svr_rmse, svr_r2),
"Gradient Boosting Regressor": (gbr_rmse, gbr_r2)
}
for name, (rmse, r2) in models.items():
print(f"{name} RMSE: {rmse:.4f}, R² Score: {r2:.4f}")
# Store actual trained models (so we can access the best one later)
trained_models = {
"Ridge Regression": ridge,
"Lasso Regression": lasso,
"Random Forest": random_forest_model,
"XGBoost": xgb_model,
# "Support Vector Regression": svr_model,
"Gradient Boosting Regressor": gbr_model
}
# Find the best model based on RMSE
best_model_name, (best_rmse, best_r2) = min(models.items(), key=lambda x: x[1][0])
print(f"Best Model: {best_model_name} with RMSE {best_rmse:.4f} and R² {best_r2:.4f}")
Ridge Regression RMSE: 8.8993, R² Score: 0.7138 Lasso Regression RMSE: 8.9590, R² Score: 0.7100 Random Forest RMSE: 7.5418, R² Score: 0.7945 XGBoost RMSE: 8.4738, R² Score: 0.7405 Gradient Boosting Regressor RMSE: 8.8100, R² Score: 0.7195 Best Model: Random Forest with RMSE 7.5418 and R² 0.7945
InĀ [41]:
### **STAGE 2: RUN RANDOMIZED SEARCH ONLY ON THE BEST MODEL** (since the dataset is too large to run on all, especially SVM)
best_model = trained_models[best_model_name] # Get the best model
# Define hyperparameter grids for tuning
param_grids = {
"Ridge Regression": {'alpha': [0.1, 1.0, 10.0, 100.0]},
"Lasso Regression": {'alpha': [0.01, 0.1, 1.0, 10.0]},
"Random Forest": {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]},
"XGBoost": {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.2]},
# "Support Vector Regression": {'C': [0.1, 1, 10], 'gamma': ['scale', 'auto']},
"Gradient Boosting Regressor": {'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 0.2]}
}
# Perform hyperparameter tuning only for the best model
print(f"Running RandomizedSearchCV for {best_model_name}...")
best_model_grid = RandomizedSearchCV(
best_model, param_grids[best_model_name],
cv=3, scoring='neg_mean_squared_error', n_iter=10, n_jobs=-1, random_state=42
)
best_model_grid.fit(X_train, y_train)
best_model_tuned = best_model_grid.best_estimator_
# Evaluate the fine-tuned best model
y_pred_tuned = best_model_tuned.predict(X_test)
tuned_rmse = np.sqrt(mean_squared_error(y_test, y_pred_tuned))
tuned_r2 = r2_score(y_test, y_pred_tuned)
print(f"Fine-tuned {best_model_name} RMSE: {tuned_rmse:.4f}, R² Score: {tuned_r2:.4f}")
# Save the fine-tuned best model
model_filename = f"best_model_{best_model_name.replace(' ', '_')}.pkl"
joblib.dump(best_model_tuned, model_filename)
print(f"Saved best model: {model_filename}")
Running RandomizedSearchCV for Random Forest...
c:\Users\samue\Anaconda3\envs\ml1\lib\site-packages\sklearn\model_selection\_search.py:317: UserWarning: The total space of parameters 9 is smaller than n_iter=10. Running 9 iterations. For exhaustive searches, use GridSearchCV.
Fine-tuned Random Forest RMSE: 7.5204, R² Score: 0.7956 Saved best model: best_model_Random_Forest.pkl
InĀ [43]:
### **STAGE 3: SHAP ANALYSIS (ONLY IF BEST MODEL SUPPORTS IT)**
if best_model_name in ["Random Forest", "XGBoost", "Gradient Boosting Regressor"]:
print(f"Running SHAP analysis for {best_model_name}...")
# Load the saved best model
best_model_loaded = joblib.load(model_filename)
# Take a small, consistent sample (30 rows for performance)
sample_X = X_test.sample(30, random_state=42)
# Use TreeExplainer explicitly (optimized for tree-based models)
explainer = shap.TreeExplainer(best_model_loaded, feature_perturbation="interventional")
# Get SHAP values for the sample
shap_values = explainer.shap_values(sample_X, check_additivity=False)
# Initialize JS visualization
shap.initjs()
# Plot SHAP summary
shap.summary_plot(shap_values, sample_X)
else:
print(f"SHAP analysis not supported for {best_model_name}.")
Running SHAP analysis for Random Forest...
Also Trying a Neural Network¶
InĀ [44]:
# Define the Neural Network model
nn_model = Sequential([
Dense(128, activation='relu', input_shape=(X_train.shape[1],)), # Input layer
Dense(64, activation='relu'), # Hidden layer 1
Dense(32, activation='relu'), # Hidden layer 2
Dense(1, activation='linear') # Output layer (linear activation for regression)
])
# Compile the model
nn_model.compile(optimizer=Adam(learning_rate=0.01), loss='mse')
# Train the model
nn_model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=1, validation_split=0.1)
# Make predictions on the test set
nn_y_pred = nn_model.predict(X_test).flatten()
# Calculate RMSE
nn_rmse = np.sqrt(mean_squared_error(y_test, nn_y_pred))
print(f"Neural Network RMSE: {nn_rmse:.4f}")
# Calculate R^2 Score
nn_r2 = r2_score(y_test, nn_y_pred)
print(f"Neural Network R² Score: {nn_r2:.4f}")
c:\Users\samue\Anaconda3\envs\ml1\lib\site-packages\keras\src\layers\core\dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
Epoch 1/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 982us/step - loss: 84.0018 - val_loss: 73.1668 Epoch 2/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 72.9136 - val_loss: 70.7585 Epoch 3/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 72.1529 - val_loss: 70.5403 Epoch 4/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 72.2145 - val_loss: 70.8903 Epoch 5/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 71.1987 - val_loss: 70.7880 Epoch 6/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 71.3719 - val_loss: 70.5414 Epoch 7/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 70.8932 - val_loss: 70.3112 Epoch 8/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 70.8482 - val_loss: 71.5304 Epoch 9/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 70.2784 - val_loss: 73.2248 Epoch 10/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 16s 1ms/step - loss: 70.7604 - val_loss: 69.8628 Epoch 11/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 70.2731 - val_loss: 69.3742 Epoch 12/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.8182 - val_loss: 69.9701 Epoch 13/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 70.1548 - val_loss: 69.2275 Epoch 14/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 14s 973us/step - loss: 69.8807 - val_loss: 69.1231 Epoch 15/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.6815 - val_loss: 70.3001 Epoch 16/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 70.3314 - val_loss: 68.9895 Epoch 17/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.7669 - val_loss: 69.3509 Epoch 18/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 14s 982us/step - loss: 69.5898 - val_loss: 69.2355 Epoch 19/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 14s 1ms/step - loss: 70.3603 - val_loss: 69.2616 Epoch 20/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 16s 1ms/step - loss: 69.9126 - val_loss: 69.0139 Epoch 21/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.8703 - val_loss: 69.1261 Epoch 22/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.5875 - val_loss: 69.2295 Epoch 23/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.5856 - val_loss: 70.4969 Epoch 24/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 14s 1ms/step - loss: 69.2655 - val_loss: 74.1596 Epoch 25/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.3082 - val_loss: 69.0217 Epoch 26/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.6650 - val_loss: 69.2313 Epoch 27/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 16s 1ms/step - loss: 69.2382 - val_loss: 68.9016 Epoch 28/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 14s 1ms/step - loss: 69.4810 - val_loss: 69.6768 Epoch 29/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 14s 1ms/step - loss: 69.5638 - val_loss: 68.7367 Epoch 30/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 14s 1ms/step - loss: 69.7113 - val_loss: 69.6965 Epoch 31/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.8176 - val_loss: 75.0907 Epoch 32/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 13s 948us/step - loss: 69.4239 - val_loss: 69.3025 Epoch 33/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.2454 - val_loss: 70.8288 Epoch 34/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.4991 - val_loss: 69.0170 Epoch 35/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 16s 1ms/step - loss: 69.2757 - val_loss: 69.4841 Epoch 36/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 16s 1ms/step - loss: 69.6567 - val_loss: 70.0721 Epoch 37/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.3887 - val_loss: 69.5970 Epoch 38/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 70.2699 - val_loss: 69.6423 Epoch 39/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.6730 - val_loss: 69.0491 Epoch 40/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 68.9219 - val_loss: 71.9004 Epoch 41/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.1525 - val_loss: 69.2683 Epoch 42/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.2460 - val_loss: 69.2524 Epoch 43/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 16s 1ms/step - loss: 69.5921 - val_loss: 69.9773 Epoch 44/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 16s 1ms/step - loss: 69.3423 - val_loss: 69.1303 Epoch 45/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.3062 - val_loss: 72.8272 Epoch 46/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 68.9610 - val_loss: 69.9755 Epoch 47/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 15s 1ms/step - loss: 69.2039 - val_loss: 70.5713 Epoch 48/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 14s 995us/step - loss: 69.1884 - val_loss: 69.8007 Epoch 49/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 16s 1ms/step - loss: 69.2787 - val_loss: 68.5429 Epoch 50/50 13947/13947 āāāāāāāāāāāāāāāāāāāā 16s 1ms/step - loss: 69.0338 - val_loss: 70.3821 3875/3875 āāāāāāāāāāāāāāāāāāāā 1s 369us/step Neural Network RMSE: 8.4163 Neural Network R² Score: 0.7440
And an Ensemble Model¶
InĀ [45]:
import shap
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import StackingRegressor, RandomForestRegressor, GradientBoostingRegressor, BaggingRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Define base models
base_models = [
('bagging', BaggingRegressor(estimator=LinearRegression(), n_estimators=10, random_state=42)),
('random_forest', RandomForestRegressor(n_estimators=50, random_state=42)),
('gbr', GradientBoostingRegressor(n_estimators=50, learning_rate=0.1, random_state=42)),
('xgb', XGBRegressor(n_estimators=50, learning_rate=0.1, random_state=42)),
]
# Define stacking model
stacked_model = StackingRegressor(estimators=base_models, final_estimator=LinearRegression())
stacked_model.fit(X_train, y_train)
# Predict on test data
y_pred = stacked_model.predict(X_test)
# Evaluate model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
n = len(y_test) # Number of samples
k = len(X_train.columns) # Number of predictors
rss = np.sum((y_test - y_pred) ** 2) # Residual sum of squares
aic = n * np.log(rss / n) + 2 * k
aicc = aic + (2 * k * (k + 1)) / (n - k - 1)
print(f"Stacking Model RMSE: {rmse:.4f}")
print(f"Stacking Model R² Score: {r2:.4f}")
print(f"Stacking Model AICc: {aicc:.4f}")
# Store model performance
model_performance = {"Stacking Model": (rmse, r2)}
Stacking Model RMSE: 7.5749 Stacking Model R² Score: 0.7927 Stacking Model AICc: 502239.5700
Model Based on Algorithmic Clusters using PCA¶
We perform clustering on the dataset to group songs with similar characteristics.
This step helps uncover hidden structures and potential mood/genre groupings in the data.
InĀ [46]:
# Import necessary libraries (if not already imported)
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, davies_bouldin_score
# For clustering and PCA, reuse X (it's already numeric and clean)
clustering_features = X.copy()
# Apply PCA on the preprocessed feature set
pca_full = PCA()
pca_full.fit(clustering_features)
# Calculate cumulative explained variance
explained_variance = np.cumsum(pca_full.explained_variance_ratio_)
# Plot explained variance
plt.figure(figsize=(10, 5))
plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o', linestyle='--')
plt.xlabel("Number of PCA Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Explained Variance (on X)")
plt.grid()
plt.show()
InĀ [47]:
# Perform PCA on ALL of X
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X)
# Store PCA components in a single DataFrame for all data
df_pca = pd.DataFrame(X_pca, columns=[f'PC{i+1}' for i in range(5)])
# Confirm the shape matches your X shape
print(f"PCA transformed data shape: {df_pca.shape}")
PCA transformed data shape: (619854, 5)
InĀ [48]:
# Get PCA loadings
pca_loadings = pd.DataFrame(
pca.components_,
columns=X.columns,
index=[f'PC{i+1}' for i in range(pca.n_components_)]
)
# Create a mask for features that contribute significantly to any component
important_features = pca_loadings.abs().max(axis=0) > 0.1
# Filter the loadings
pca_loadings_filtered = pca_loadings.loc[:, important_features]
# Plot filtered loadings
plt.figure(figsize=(12, 6))
sns.heatmap(
pca_loadings_filtered.abs(),
cmap="viridis",
annot=True,
fmt=".2f",
linewidths=0.5,
cbar_kws={"label": "Contribution (absolute value)"},
annot_kws={"size": 8}
)
plt.xlabel("Original Features", fontsize=12)
plt.ylabel("Principal Components", fontsize=12)
plt.title("Feature Contributions to Principal Components (Filtered Features)", fontsize=16)
plt.xticks(rotation=45, ha='right', fontsize=10)
plt.tight_layout()
plt.show()
InĀ [49]:
# Elbow method to determine the optimal number of clusters (We are using the K-Means Method because the data appears VERY globular/spherical)
inertia = []
K_range = range(2, 10)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(df_pca)
inertia.append(kmeans.inertia_)
# Plot the elbow method graph
plt.figure(figsize=(8, 5))
plt.plot(K_range, inertia, marker="o", linestyle="--", color="b")
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Inertia")
plt.title("Elbow Method for Optimal K")
plt.grid()
plt.show()
InĀ [50]:
# Apply K-Means clustering (the optimal number of clusters appears to be 4 based on the elbow method)
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
# Get cluster labels and assign them to the original dataset
df_final["cluster"] = kmeans.fit_predict(df_pca)
InĀ [51]:
# Add the cluster labels to the PCA dataframe for visualization (optional but clean)
df_pca['cluster'] = df_final['cluster']
# Plot the PCA-reduced data, colored by KMeans cluster assignments
plt.figure(figsize=(12, 8))
sns.scatterplot(
data=df_pca,
x='PC1', y='PC2',
hue='cluster',
palette='Set2',
alpha=0.6, # Adjust alpha for better visibility
s=50 # Marker size
)
plt.title(f'K-Means Clustering (K={optimal_k})')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend(title='Cluster', loc='best')
plt.grid(True)
plt.show()
InĀ [52]:
# Required Imports
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge, Lasso
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import joblib
# Lists to store metrics for each cluster
cluster_rmses = []
cluster_r2s = []
# Iterate over each cluster in the dataset
for cluster_id in df_final['cluster'].unique():
print(f"\nProcessing Cluster {cluster_id}...\n")
# Filter data for the current cluster
cluster_data = df_final[df_final['cluster'] == cluster_id]
# Define features (drop irrelevant columns) and target
X_cluster = cluster_data.drop(columns=['popularity', 'track_id', 'track_name', 'artist_name'])
y_cluster = cluster_data['popularity']
# Ensure features are floats for compatibility with models
X_cluster = X_cluster.astype(float)
# Train-test split for current cluster
X_train, X_test, y_train, y_test = train_test_split(
X_cluster, y_cluster, test_size=0.2, random_state=42
)
print(f"Cluster {cluster_id} dataset shape: {X_train.shape}, {X_test.shape}")
# Define models to evaluate
models = {
'Ridge': Ridge(random_state=42),
'Lasso': Lasso(random_state=42),
'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
'XGBoost': XGBRegressor(n_estimators=100, random_state=42, n_jobs=-1),
'GradientBoosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
}
# Evaluate each model
best_model = None
best_model_name = ''
best_rmse = float('inf')
best_r2 = None
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"{name} RMSE: {rmse:.4f}, R² Score: {r2:.4f}")
# Select the best model by RMSE
if rmse < best_rmse:
best_rmse = rmse
best_r2 = r2
best_model = model
best_model_name = name
# Save best model for this cluster
model_filename = f"best_model_cluster_{cluster_id}_{best_model_name}.pkl"
joblib.dump(best_model, model_filename)
print(f"\nBest Model for Cluster {cluster_id}: {best_model_name} with RMSE {best_rmse:.4f} and R² {best_r2:.4f}")
print(f"Saved best model for Cluster {cluster_id} as {model_filename}\n")
# Store metrics
cluster_rmses.append(best_rmse)
cluster_r2s.append(best_r2)
# Overall results summary
print("\nSummary of Cluster-Based Models Performance:\n")
for idx, cluster_id in enumerate(df_final['cluster'].unique()):
print(f"Cluster {cluster_id}: RMSE = {cluster_rmses[idx]:.4f}, R² = {cluster_r2s[idx]:.4f}")
Processing Cluster 1... Cluster 1 dataset shape: (243756, 99), (60939, 99) Ridge RMSE: 9.2230, R² Score: 0.7080 Lasso RMSE: 10.0409, R² Score: 0.6540 RandomForest RMSE: 7.8080, R² Score: 0.7908 XGBoost RMSE: 8.5269, R² Score: 0.7505 GradientBoosting RMSE: 9.1024, R² Score: 0.7156 Best Model for Cluster 1: RandomForest with RMSE 7.8080 and R² 0.7908 Saved best model for Cluster 1 as best_model_cluster_1_RandomForest.pkl Processing Cluster 2... Cluster 2 dataset shape: (171664, 99), (42916, 99) Ridge RMSE: 8.7199, R² Score: 0.7178 Lasso RMSE: 9.3707, R² Score: 0.6741 RandomForest RMSE: 7.6629, R² Score: 0.7821 XGBoost RMSE: 8.1252, R² Score: 0.7550 GradientBoosting RMSE: 8.6421, R² Score: 0.7228 Best Model for Cluster 2: RandomForest with RMSE 7.6629 and R² 0.7821 Saved best model for Cluster 2 as best_model_cluster_2_RandomForest.pkl Processing Cluster 0... Cluster 0 dataset shape: (60704, 99), (15176, 99) Ridge RMSE: 8.0694, R² Score: 0.6817 Lasso RMSE: 8.6789, R² Score: 0.6318 RandomForest RMSE: 7.2250, R² Score: 0.7448 XGBoost RMSE: 7.4702, R² Score: 0.7272 GradientBoosting RMSE: 7.9308, R² Score: 0.6925 Best Model for Cluster 0: RandomForest with RMSE 7.2250 and R² 0.7448 Saved best model for Cluster 0 as best_model_cluster_0_RandomForest.pkl Processing Cluster 3... Cluster 3 dataset shape: (19759, 99), (4940, 99) Ridge RMSE: 8.4904, R² Score: 0.6009 Lasso RMSE: 8.9824, R² Score: 0.5534 RandomForest RMSE: 7.7821, R² Score: 0.6648 XGBoost RMSE: 7.9140, R² Score: 0.6533 GradientBoosting RMSE: 8.1353, R² Score: 0.6336 Best Model for Cluster 3: RandomForest with RMSE 7.7821 and R² 0.6648 Saved best model for Cluster 3 as best_model_cluster_3_RandomForest.pkl Summary of Cluster-Based Models Performance: Cluster 1: RMSE = 7.8080, R² = 0.7908 Cluster 2: RMSE = 7.6629, R² = 0.7821 Cluster 0: RMSE = 7.2250, R² = 0.7448 Cluster 3: RMSE = 7.7821, R² = 0.6648
InĀ [53]:
df_final.drop(columns=['cluster'], inplace=True)
Part 4: Modeling using Genre and Mood Clusters¶
InĀ [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
InĀ [4]:
# Load the dataset
df = pd.read_csv('spotify_data.csv')
InĀ [5]:
df.head()
Out[5]:
| Unnamed: 0 | artist_name | track_name | track_id | popularity | year | genre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | time_signature | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Jason Mraz | I Won't Give Up | 53QF56cjZA9RTuuMZDrSA6 | 68 | 2012 | acoustic | 0.483 | 0.303 | 4 | -10.058 | 1 | 0.0429 | 0.6940 | 0.000000 | 0.1150 | 0.139 | 133.406 | 240166 | 3 |
| 1 | 1 | Jason Mraz | 93 Million Miles | 1s8tP3jP4GZcyHDsjvw218 | 50 | 2012 | acoustic | 0.572 | 0.454 | 3 | -10.286 | 1 | 0.0258 | 0.4770 | 0.000014 | 0.0974 | 0.515 | 140.182 | 216387 | 4 |
| 2 | 2 | Joshua Hyslop | Do Not Let Me Go | 7BRCa8MPiyuvr2VU3O9W0F | 57 | 2012 | acoustic | 0.409 | 0.234 | 3 | -13.711 | 1 | 0.0323 | 0.3380 | 0.000050 | 0.0895 | 0.145 | 139.832 | 158960 | 4 |
| 3 | 3 | Boyce Avenue | Fast Car | 63wsZUhUZLlh1OsyrZq7sz | 58 | 2012 | acoustic | 0.392 | 0.251 | 10 | -9.845 | 1 | 0.0363 | 0.8070 | 0.000000 | 0.0797 | 0.508 | 204.961 | 304293 | 4 |
| 4 | 4 | Andrew Belle | Sky's Still Blue | 6nXIYClvJAfi6ujLiKqEq8 | 54 | 2012 | acoustic | 0.430 | 0.791 | 6 | -5.419 | 0 | 0.0302 | 0.0726 | 0.019300 | 0.1100 | 0.217 | 171.864 | 244320 | 4 |
InĀ [6]:
# Display basic information
print(f"Dataset shape: {df.shape}")
Dataset shape: (1159764, 20)
InĀ [7]:
# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())
Missing values: Unnamed: 0 0 artist_name 15 track_name 1 track_id 0 popularity 0 year 0 genre 0 danceability 0 energy 0 key 0 loudness 0 mode 0 speechiness 0 acousticness 0 instrumentalness 0 liveness 0 valence 0 tempo 0 duration_ms 0 time_signature 0 dtype: int64
InĀ [8]:
# Simple way to drop rows with missing values
df = df.dropna()
print(f"Shape after dropping rows with missing values: {df.shape} -> {df.shape}")
Shape after dropping rows with missing values: (1159748, 20) -> (1159748, 20)
InĀ [9]:
df = df.drop(columns=['track_id', 'track_name', 'artist_name'])
InĀ [10]:
df = df.drop(columns=['Unnamed: 0'])
Let's classify our songs by mood¶
Mood Classification Approach¶
Framework Used: Thayer/Russell model of musical emotion
Primary Dimensions:
- Valence (musical positiveness/happiness)
- Energy (intensity/arousal)
Threshold Selection:
- Used standard midpoint of 0.5 for both dimensions
- Creates four intuitive mood quadrants
- Aligns with normalized Spotify audio features (0-1 scale)
Mood Categories:
- Happy/Energetic: High valence (ā„0.5) + High energy (ā„0.5)
- Peaceful/Relaxed: High valence (ā„0.5) + Low energy (<0.5)
- Angry/Tense: Low valence (<0.5) + High energy (ā„0.5)
- Sad/Depressed: Low valence (<0.5) + Low energy (<0.5)
Justification:
- Grounded in established music psychology research
- Provides clear, interpretable categories
- Enables direct comparison of popularity prediction models across mood types
- Avoids arbitrary thresholds by using standard midpoints
InĀ [11]:
# Step 1: Function to assign moods based on valence and energy
def assign_mood(row):
"""Assign mood label based on Thayer model thresholds"""
if row['valence'] >= 0.5 and row['energy'] >= 0.5:
return "Happy/Energetic" # High valence, high energy
elif row['valence'] >= 0.5 and row['energy'] < 0.5:
return "Peaceful/Relaxed" # High valence, low energy
elif row['valence'] < 0.5 and row['energy'] >= 0.5:
return "Angry/Tense" # Low valence, high energy
else:
return "Sad/Depressed" # Low valence, low energy
InĀ [12]:
# Step 2: Apply the mood classification to the dataframe
def classify_songs_by_mood(df):
"""Apply mood classification to songs based on audio features"""
# Create a copy to avoid modifying the original
df_moods = df.copy()
# Apply the mood assignment function
df_moods['mood_label'] = df_moods.apply(assign_mood, axis=1)
return df_moods
InĀ [13]:
df_with_moods = classify_songs_by_mood(df)
InĀ [14]:
# Step 3: Analyze the mood distribution
def analyze_mood_distribution(df_moods):
"""Analyze and display mood distribution statistics"""
# Display the distribution of moods
print("Mood distribution:")
mood_counts = df_moods['mood_label'].value_counts()
print(mood_counts)
# Calculate percentages
mood_percentages = df_moods['mood_label'].value_counts(normalize=True) * 100
print("\nPercentage distribution:")
for mood, percentage in mood_percentages.items():
print(f"{mood}: {percentage:.1f}%")
# Look at average audio features by mood
mood_features = df_moods.groupby('mood_label')[
['valence', 'energy', 'danceability', 'acousticness',
'instrumentalness', 'tempo', 'loudness']
].mean()
print("\nAverage audio features by mood:")
print(mood_features)
return mood_counts, mood_features
InĀ [15]:
# Then analyze the mood distribution
mood_counts, mood_features = analyze_mood_distribution(df_with_moods)
Mood distribution:
mood_label
Happy/Energetic 415170
Angry/Tense 406532
Sad/Depressed 251438
Peaceful/Relaxed 86608
Name: count, dtype: int64
Percentage distribution:
Happy/Energetic: 35.8%
Angry/Tense: 35.1%
Sad/Depressed: 21.7%
Peaceful/Relaxed: 7.5%
Average audio features by mood:
valence energy danceability acousticness \
mood_label
Angry/Tense 0.274559 0.800799 0.492348 0.115178
Happy/Energetic 0.727080 0.772064 0.627835 0.219054
Peaceful/Relaxed 0.679698 0.359201 0.621856 0.630243
Sad/Depressed 0.222715 0.257170 0.432022 0.718072
instrumentalness tempo loudness
mood_label
Angry/Tense 0.288763 126.243263 -6.650786
Happy/Energetic 0.138861 124.472668 -6.578804
Peaceful/Relaxed 0.225239 117.071605 -12.355621
Sad/Depressed 0.390171 109.882856 -15.553975
InĀ [16]:
# Step 4: Visualize the mood classification
def visualize_mood_classification(df_moods):
"""Create visualization of mood classification"""
# Create a visualization
plt.figure(figsize=(12, 8))
# Define colors for each mood
colors = {'Happy/Energetic': 'gold',
'Peaceful/Relaxed': 'lightgreen',
'Angry/Tense': 'red',
'Sad/Depressed': 'blue'}
# Sample to avoid overcrowding the plot
sample = df_moods.sample(min(5000, len(df_moods)))
# Plot each mood category
for mood, color in colors.items():
mood_data = sample[sample['mood_label'] == mood]
plt.scatter(mood_data['valence'], mood_data['energy'],
color=color, label=mood, alpha=0.6)
# Add dividing lines at 0.5 for both axes
plt.axvline(x=0.5, color='gray', linestyle='--', alpha=0.5)
plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
# Add quadrant labels
plt.text(0.25, 0.75, "Angry/Tense", horizontalalignment='center', fontsize=12)
plt.text(0.75, 0.75, "Happy/Energetic", horizontalalignment='center', fontsize=12)
plt.text(0.25, 0.25, "Sad/Depressed", horizontalalignment='center', fontsize=12)
plt.text(0.75, 0.25, "Peaceful/Relaxed", horizontalalignment='center', fontsize=12)
plt.title('Song Mood Classification Based on Thayer Model', fontsize=16)
plt.xlabel('Valence (Musical Positiveness)', fontsize=14)
plt.ylabel('Energy', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Step 5: Main function to run the entire classification process
def categorize_songs_by_mood(df):
"""Main function to run the mood classification process"""
# Apply the classification
df_moods = classify_songs_by_mood(df)
# Analyze the distribution
analyze_mood_distribution(df_moods)
# Create visualization
visualize_mood_classification(df_moods)
return df_moods
InĀ [17]:
# Finally visualize the classification
visualize_mood_classification(df_with_moods)
InĀ [18]:
def analyze_genre_by_mood(df_with_moods, top_n=5):
"""
Analyze and visualize the distribution of top genres within each mood category
Parameters:
df_with_moods (pandas.DataFrame): DataFrame with mood classification
top_n (int): Number of top genres to show for each mood
"""
# Create a figure with subplots for each mood
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()
# Define colors for each mood
mood_colors = {
'Happy/Energetic': 'gold',
'Peaceful/Relaxed': 'lightgreen',
'Angry/Tense': 'red',
'Sad/Depressed': 'blue'
}
# Get all moods
moods = df_with_moods['mood_label'].unique()
# For each mood, plot the top genres
for i, mood in enumerate(sorted(moods)):
# Filter for this mood
mood_data = df_with_moods[df_with_moods['mood_label'] == mood]
# Get top genres for this mood
top_genres = mood_data['genre'].value_counts().head(top_n)
# Plot
axes[i].bar(top_genres.index, top_genres.values, color=mood_colors[mood], alpha=0.7)
axes[i].set_title(f'Top {top_n} Genres in {mood} Category', fontsize=14)
axes[i].set_ylabel('Number of Songs', fontsize=12)
axes[i].tick_params(axis='x', rotation=45)
axes[i].grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
# Also create a table showing percentage of each genre in each mood
print(f"Top {top_n} genres in each mood category:")
for mood in sorted(moods):
mood_data = df_with_moods[df_with_moods['mood_label'] == mood]
top_genres = mood_data['genre'].value_counts().head(top_n)
total = len(mood_data)
print(f"\n{mood}:")
for genre, count in top_genres.items():
percentage = (count / total) * 100
print(f" {genre}: {count} songs ({percentage:.1f}%)")
# Call the function to analyze genre distribution by mood
analyze_genre_by_mood(df_with_moods)
Top 5 genres in each mood category: Angry/Tense: black-metal: 19265 songs (4.7%) death-metal: 16803 songs (4.1%) grindcore: 13251 songs (3.3%) heavy-metal: 11299 songs (2.8%) emo: 11295 songs (2.8%) Happy/Energetic: forro: 17693 songs (4.3%) salsa: 15234 songs (3.7%) dancehall: 13965 songs (3.4%) samba: 11920 songs (2.9%) sertanejo: 11557 songs (2.8%) Peaceful/Relaxed: tango: 8482 songs (9.8%) guitar: 3762 songs (4.3%) comedy: 3648 songs (4.2%) rock-n-roll: 3602 songs (4.2%) acoustic: 3231 songs (3.7%) Sad/Depressed: ambient: 17281 songs (6.9%) new-age: 16940 songs (6.7%) sleep: 14116 songs (5.6%) classical: 13612 songs (5.4%) opera: 12483 songs (5.0%)
InĀ [19]:
pd.set_option('display.max_columns', None)
df_with_moods.head()
Out[19]:
| popularity | year | genre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | time_signature | mood_label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 68 | 2012 | acoustic | 0.483 | 0.303 | 4 | -10.058 | 1 | 0.0429 | 0.6940 | 0.000000 | 0.1150 | 0.139 | 133.406 | 240166 | 3 | Sad/Depressed |
| 1 | 50 | 2012 | acoustic | 0.572 | 0.454 | 3 | -10.286 | 1 | 0.0258 | 0.4770 | 0.000014 | 0.0974 | 0.515 | 140.182 | 216387 | 4 | Peaceful/Relaxed |
| 2 | 57 | 2012 | acoustic | 0.409 | 0.234 | 3 | -13.711 | 1 | 0.0323 | 0.3380 | 0.000050 | 0.0895 | 0.145 | 139.832 | 158960 | 4 | Sad/Depressed |
| 3 | 58 | 2012 | acoustic | 0.392 | 0.251 | 10 | -9.845 | 1 | 0.0363 | 0.8070 | 0.000000 | 0.0797 | 0.508 | 204.961 | 304293 | 4 | Peaceful/Relaxed |
| 4 | 54 | 2012 | acoustic | 0.430 | 0.791 | 6 | -5.419 | 0 | 0.0302 | 0.0726 | 0.019300 | 0.1100 | 0.217 | 171.864 | 244320 | 4 | Angry/Tense |
Modeling and Comparison with the "Base" Model¶
InĀ [20]:
print("Columns in df before dropping 'year':")
print(df.columns)
if 'year' in df.columns:
df.drop(columns=['year'], inplace=True, errors='ignore')
print("'year' column dropped successfully.")
else:
print("'year' column is not present in the DataFrame.")
Columns in df before dropping 'year':
Index(['popularity', 'year', 'genre', 'danceability', 'energy', 'key',
'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature'],
dtype='object')
'year' column dropped successfully.
InĀ [21]:
import pandas as pd
import numpy as np
import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
#############################################################################
# 1. Load and Prepare the Full Dataset
#############################################################################
df_full = df_with_moods.copy()
print(f"Full dataset size: {len(df_full):,} rows")
# Take a 15% random sample
sample_fraction = 0.15
df = df_full.sample(frac=sample_fraction, random_state=42)
print(f"Sampled dataset size (10%): {len(df):,} rows")
# If you want to drop 'artist_popularity' or other columns:
if 'artist_popularity' in df.columns:
df.drop(columns=['artist_popularity'], inplace=True, errors='ignore')
# (Optional) If you WANT to drop 'year':
# if 'year' in df.columns:
# df.drop(columns=['year'], inplace=True, errors='ignore')
print("Columns before encoding:", df.columns.tolist())
# 2. Encode 'genre' + 'mood_label'
# - For 'genre', we can use drop_first=True to avoid the dummy trap
# - For 'mood_label', keep all dummies (drop_first=False)
df = pd.get_dummies(df, columns=['genre'], drop_first=True)
df = pd.get_dummies(df, columns=['mood_label'], prefix='mood', drop_first=False)
print("Columns after encoding:", df.columns.tolist())
#############################################################################
# 3. Scale Numerical Features
#############################################################################
numerical_columns = [
'year', 'danceability', 'energy', 'key', 'loudness',
'mode', 'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature'
]
scaler = StandardScaler()
# Only scale columns that actually exist in df
for col in numerical_columns:
if col in df.columns:
df[col] = scaler.fit_transform(df[[col]])
print("Scaling complete.")
#############################################################################
# 4. Train/Test Split for the General Model (All Rows)
#############################################################################
X = df.drop(columns=['popularity','track_id','track_name','artist_name'], errors='ignore')
y = df['popularity']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
general_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
general_model.fit(X_train, y_train)
# Evaluate
general_preds = general_model.predict(X_test)
general_rmse = np.sqrt(mean_squared_error(y_test, general_preds))
general_r2 = r2_score(y_test, general_preds)
print(f"\n=== General Model (All Data) ===")
print(f"RMSE: {general_rmse:.4f}")
print(f"R²: {general_r2:.4f}")
# Save the general model
joblib.dump(general_model, "model_general_all_data.pkl")
print("Saved: 'model_general_all_data.pkl'")
#############################################################################
# 5. Train One Model per Mood Using the Already-Transformed Data
#############################################################################
moods = df_with_moods['mood_label'].unique()
mood_models = {}
mood_metrics = []
for mood in moods:
# Because we used get_dummies(mood_label, drop_first=False, prefix='mood'),
# the dummy column is "mood_<mood>" as it appears in the original df.
# Replace any spaces or slashes in the mood string to match how get_dummies named it.
# For example, a mood "Angry/Tense" typically becomes "mood_Angry/Tense".
# But if your mood has spaces, it might become "mood_Angry_Tense", etc.
# Let's see how pandas actually named them:
dummy_col = f"mood_{mood}"
# If your mood has special characters, you might need to confirm the exact naming in df.columns
if dummy_col not in df.columns:
print(f"Skipping mood '{mood}' because '{dummy_col}' not found in columns.")
continue
# Filter the entire DF for this mood
df_mood = df[df[dummy_col] == 1].copy()
# If there's not enough data, skip it
if len(df_mood) < 50:
print(f"Skipping mood '{mood}' -> only {len(df_mood)} rows.")
continue
X_mood = df_mood.drop(columns=['popularity','track_id','track_name','artist_name'], errors='ignore')
y_mood = df_mood['popularity']
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
X_mood, y_mood, test_size=0.2, random_state=42
)
# Train
mood_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
mood_model.fit(X_train_m, y_train_m)
# Evaluate
mood_preds = mood_model.predict(X_test_m)
mood_rmse = np.sqrt(mean_squared_error(y_test_m, mood_preds))
mood_r2 = r2_score(y_test_m, mood_preds)
print(f"\n=== Mood '{mood}' Model ===")
print(f"RMSE: {mood_rmse:.4f}")
print(f"R²: {mood_r2:.4f}")
# Save
safe_mood_name = mood.replace("/", "_").replace(" ", "_") # Clean up for filename
model_filename = f"model_mood_{safe_mood_name}.pkl"
joblib.dump(mood_model, model_filename)
print(f"Saved: '{model_filename}'")
# Store
mood_models[mood] = mood_model
mood_metrics.append({
'mood': mood,
'rmse': mood_rmse,
'r2': mood_r2,
'n_samples': len(df_mood)
})
#############################################################################
# 6. Summary of All Mood Models
#############################################################################
print("\n=== Summary of Mood-Specific Models ===")
for metric in mood_metrics:
print(f"Mood: {metric['mood']}, N={metric['n_samples']} | RMSE={metric['rmse']:.4f}, R²={metric['r2']:.4f}")
Full dataset size: 1,159,748 rows Sampled dataset size (10%): 173,962 rows Columns before encoding: ['popularity', 'year', 'genre', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'mood_label'] Columns after encoding: ['popularity', 'year', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'genre_afrobeat', 'genre_alt-rock', 'genre_ambient', 'genre_black-metal', 'genre_blues', 'genre_breakbeat', 'genre_cantopop', 'genre_chicago-house', 'genre_chill', 'genre_classical', 'genre_club', 'genre_comedy', 'genre_country', 'genre_dance', 'genre_dancehall', 'genre_death-metal', 'genre_deep-house', 'genre_detroit-techno', 'genre_disco', 'genre_drum-and-bass', 'genre_dub', 'genre_dubstep', 'genre_edm', 'genre_electro', 'genre_electronic', 'genre_emo', 'genre_folk', 'genre_forro', 'genre_french', 'genre_funk', 'genre_garage', 'genre_german', 'genre_gospel', 'genre_goth', 'genre_grindcore', 'genre_groove', 'genre_guitar', 'genre_hard-rock', 'genre_hardcore', 'genre_hardstyle', 'genre_heavy-metal', 'genre_hip-hop', 'genre_house', 'genre_indian', 'genre_indie-pop', 'genre_industrial', 'genre_jazz', 'genre_k-pop', 'genre_metal', 'genre_metalcore', 'genre_minimal-techno', 'genre_new-age', 'genre_opera', 'genre_party', 'genre_piano', 'genre_pop', 'genre_pop-film', 'genre_power-pop', 'genre_progressive-house', 'genre_psych-rock', 'genre_punk', 'genre_punk-rock', 'genre_rock', 'genre_rock-n-roll', 'genre_romance', 'genre_sad', 'genre_salsa', 'genre_samba', 'genre_sertanejo', 'genre_show-tunes', 'genre_singer-songwriter', 'genre_ska', 'genre_sleep', 'genre_songwriter', 'genre_soul', 'genre_spanish', 'genre_swedish', 'genre_tango', 'genre_techno', 'genre_trance', 'genre_trip-hop', 'mood_Angry/Tense', 'mood_Happy/Energetic', 'mood_Peaceful/Relaxed', 'mood_Sad/Depressed'] Scaling complete. === General Model (All Data) === RMSE: 9.4912 R²: 0.6397 Saved: 'model_general_all_data.pkl' === Mood 'Sad/Depressed' Model === RMSE: 9.8869 R²: 0.6088 Saved: 'model_mood_Sad_Depressed.pkl' === Mood 'Peaceful/Relaxed' Model === RMSE: 9.8788 R²: 0.5763 Saved: 'model_mood_Peaceful_Relaxed.pkl' === Mood 'Angry/Tense' Model === RMSE: 9.4975 R²: 0.6335 Saved: 'model_mood_Angry_Tense.pkl' === Mood 'Happy/Energetic' Model === RMSE: 10.4103 R²: 0.5927 Saved: 'model_mood_Happy_Energetic.pkl' === Summary of Mood-Specific Models === Mood: Sad/Depressed, N=37874 | RMSE=9.8869, R²=0.6088 Mood: Peaceful/Relaxed, N=13080 | RMSE=9.8788, R²=0.5763 Mood: Angry/Tense, N=60497 | RMSE=9.4975, R²=0.6335 Mood: Happy/Energetic, N=62511 | RMSE=10.4103, R²=0.5927
Genre + Mood Clustering¶
InĀ [22]:
import pandas as pd
import numpy as np
import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
#############################################################################
# 1. Load Your Data and Identify Top 5 Genres
#############################################################################
df = df_with_moods.copy()
# (Optional) Drop columns not needed:
for col in ['artist_popularity']: # or 'year' if you want to drop it
if col in df.columns:
df.drop(columns=[col], inplace=True, errors='ignore')
# Suppose these are your 4 moods:
# If your dataset has these exact 4, confirm strings match what's in your data
all_moods = ["Happy/Energetic", "Angry/Tense", "Peaceful/Relaxed", "Sad/Depressed"]
# Count genres and pick top 5
top_5_genres = df['genre'].value_counts().head(5).index.tolist()
print("Top 5 genres:", top_5_genres)
# Filter dataset to only those top 5 genres
df_top5 = df[df['genre'].isin(top_5_genres)].copy()
print(f"Filtered to top-5 genres: {len(df_top5):,} rows total.")
#############################################################################
# 2. Encode Genre & Mood in a Single Pass
#############################################################################
# We'll include ALL mood labels (not dropping the first), so each mood has a column.
# For genre, you can choose drop_first=True or False. If you set True, you'll have
# one fewer column. But let's set drop_first=False so each of the top 5 has its own dummy.
df_top5 = pd.get_dummies(df_top5, columns=['genre'], drop_first=False)
df_top5 = pd.get_dummies(df_top5, columns=['mood_label'], prefix='mood', drop_first=False)
print("After encoding, columns:", df_top5.columns.tolist())
#############################################################################
# 3. Scale Numeric Columns
#############################################################################
numerical_columns = [
'year', 'danceability', 'energy', 'key', 'loudness', 'mode',
'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature'
]
scaler = StandardScaler()
for col in numerical_columns:
if col in df_top5.columns:
df_top5[col] = scaler.fit_transform(df_top5[[col]])
#############################################################################
# 4. (Optional) Train a "General" Model on This Top-5-Genre Dataset
#############################################################################
X_general = df_top5.drop(columns=['popularity','track_id','track_name','artist_name'], errors='ignore')
y_general = df_top5['popularity']
X_train_g, X_test_g, y_train_g, y_test_g = train_test_split(X_general, y_general, test_size=0.2, random_state=42)
general_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
general_model.fit(X_train_g, y_train_g)
gen_preds = general_model.predict(X_test_g)
gen_rmse = np.sqrt(mean_squared_error(y_test_g, gen_preds))
gen_r2 = r2_score(y_test_g, gen_preds)
print("\n=== General Model (Top-5 Genres) ===")
print(f"RMSE: {gen_rmse:.4f}, R²: {gen_r2:.4f}")
joblib.dump(general_model, "model_general_top5.pkl")
print("Saved model_general_top5.pkl\n")
#############################################################################
# 5. Genre + Mood Submodels
#############################################################################
genre_mood_metrics = []
genre_mood_models = {}
for genre in top_5_genres:
# The dummy column for this genre is 'genre_<genre>'
genre_col = f"genre_{genre}"
# Check if that column exists
if genre_col not in df_top5.columns:
print(f"Skipping genre '{genre}' -> no dummy column found.")
continue
for mood in all_moods:
# The dummy column for this mood is 'mood_<mood>'
# But be mindful if your mood name has spaces/slashes;
# For example, "Angry/Tense" -> "mood_Angry/Tense" or "mood_Angry_Tense"
# Let's see exactly how get_dummies named it. Usually it uses underscores for spaces.
# Actually, let's check the real columns in df_top5 to confirm.
# For demonstration, let's assume it becomes "mood_Angry/Tense".
mood_col = f"mood_{mood}"
if mood_col not in df_top5.columns:
print(f"Skipping mood '{mood}' -> no dummy column named '{mood_col}' found.")
continue
# Filter for the combination
df_subset = df_top5[(df_top5[genre_col] == 1) & (df_top5[mood_col] == 1)].copy()
n_rows = len(df_subset)
if n_rows < 50:
print(f"Skipping combo (genre='{genre}', mood='{mood}') -> only {n_rows} rows.")
continue
# Build features & target
X_sub = df_subset.drop(columns=['popularity','track_id','track_name','artist_name'], errors='ignore')
y_sub = df_subset['popularity']
# Train/test split
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_sub, y_sub, test_size=0.2, random_state=42)
# Train a model
combo_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
combo_model.fit(X_train_s, y_train_s)
# Evaluate
preds_s = combo_model.predict(X_test_s)
rmse_s = np.sqrt(mean_squared_error(y_test_s, preds_s))
r2_s = r2_score(y_test_s, preds_s)
# Save
# Make a "safe" filename in case your genre or mood strings have slashes/spaces
safe_genre = genre.replace("/", "_").replace(" ", "_")
safe_mood = mood.replace("/", "_").replace(" ", "_")
model_filename = f"model_genre_{safe_genre}_mood_{safe_mood}.pkl"
joblib.dump(combo_model, model_filename)
print(f"Trained combo: (Genre='{genre}', Mood='{mood}') -> N={n_rows}, RMSE={rmse_s:.4f}, R²={r2_s:.4f}")
print(f"Saved: {model_filename}\n")
# Store results
genre_mood_models[(genre, mood)] = combo_model
genre_mood_metrics.append({
'genre': genre,
'mood': mood,
'n_rows': n_rows,
'rmse': rmse_s,
'r2': r2_s
})
#############################################################################
# 6. Summary of Genre+Mood Models
#############################################################################
print("\n=== Summary of (Genre, Mood) Models ===")
for gm in genre_mood_metrics:
print(f"Genre='{gm['genre']}', Mood='{gm['mood']}', N={gm['n_rows']}, "
f"RMSE={gm['rmse']:.4f}, R²={gm['r2']:.4f}")
Top 5 genres: ['black-metal', 'gospel', 'ambient', 'acoustic', 'alt-rock'] Filtered to top-5 genres: 106,862 rows total. After encoding, columns: ['popularity', 'year', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'genre_acoustic', 'genre_alt-rock', 'genre_ambient', 'genre_black-metal', 'genre_gospel', 'mood_Angry/Tense', 'mood_Happy/Energetic', 'mood_Peaceful/Relaxed', 'mood_Sad/Depressed'] === General Model (Top-5 Genres) === RMSE: 9.2960, R²: 0.5951 Saved model_general_top5.pkl Trained combo: (Genre='black-metal', Mood='Happy/Energetic') -> N=614, RMSE=8.1305, R²=0.0936 Saved: model_genre_black-metal_mood_Happy_Energetic.pkl Trained combo: (Genre='black-metal', Mood='Angry/Tense') -> N=19265, RMSE=7.5115, R²=0.2530 Saved: model_genre_black-metal_mood_Angry_Tense.pkl Skipping combo (genre='black-metal', mood='Peaceful/Relaxed') -> only 30 rows. Trained combo: (Genre='black-metal', Mood='Sad/Depressed') -> N=1928, RMSE=7.1810, R²=0.2695 Saved: model_genre_black-metal_mood_Sad_Depressed.pkl Trained combo: (Genre='gospel', Mood='Happy/Energetic') -> N=7527, RMSE=8.6510, R²=0.1454 Saved: model_genre_gospel_mood_Happy_Energetic.pkl Trained combo: (Genre='gospel', Mood='Angry/Tense') -> N=6137, RMSE=11.0252, R²=0.2640 Saved: model_genre_gospel_mood_Angry_Tense.pkl Trained combo: (Genre='gospel', Mood='Peaceful/Relaxed') -> N=1634, RMSE=7.7421, R²=0.0722 Saved: model_genre_gospel_mood_Peaceful_Relaxed.pkl Trained combo: (Genre='gospel', Mood='Sad/Depressed') -> N=6323, RMSE=10.3212, R²=0.2108 Saved: model_genre_gospel_mood_Sad_Depressed.pkl Trained combo: (Genre='ambient', Mood='Happy/Energetic') -> N=694, RMSE=8.6806, R²=0.5388 Saved: model_genre_ambient_mood_Happy_Energetic.pkl Trained combo: (Genre='ambient', Mood='Angry/Tense') -> N=2752, RMSE=8.8545, R²=0.5428 Saved: model_genre_ambient_mood_Angry_Tense.pkl Trained combo: (Genre='ambient', Mood='Peaceful/Relaxed') -> N=662, RMSE=8.9283, R²=0.4806 Saved: model_genre_ambient_mood_Peaceful_Relaxed.pkl Trained combo: (Genre='ambient', Mood='Sad/Depressed') -> N=17281, RMSE=9.4974, R²=0.4932 Saved: model_genre_ambient_mood_Sad_Depressed.pkl Trained combo: (Genre='acoustic', Mood='Happy/Energetic') -> N=4276, RMSE=10.3877, R²=0.2886 Saved: model_genre_acoustic_mood_Happy_Energetic.pkl Trained combo: (Genre='acoustic', Mood='Angry/Tense') -> N=3268, RMSE=10.3847, R²=0.2644 Saved: model_genre_acoustic_mood_Angry_Tense.pkl Trained combo: (Genre='acoustic', Mood='Peaceful/Relaxed') -> N=3231, RMSE=9.8868, R²=0.4829 Saved: model_genre_acoustic_mood_Peaceful_Relaxed.pkl Trained combo: (Genre='acoustic', Mood='Sad/Depressed') -> N=10322, RMSE=10.2575, R²=0.4241 Saved: model_genre_acoustic_mood_Sad_Depressed.pkl Trained combo: (Genre='alt-rock', Mood='Happy/Energetic') -> N=7867, RMSE=10.3670, R²=0.1360 Saved: model_genre_alt-rock_mood_Happy_Energetic.pkl Trained combo: (Genre='alt-rock', Mood='Angry/Tense') -> N=10062, RMSE=9.7566, R²=0.0406 Saved: model_genre_alt-rock_mood_Angry_Tense.pkl Trained combo: (Genre='alt-rock', Mood='Peaceful/Relaxed') -> N=529, RMSE=11.5521, R²=0.0642 Saved: model_genre_alt-rock_mood_Peaceful_Relaxed.pkl Trained combo: (Genre='alt-rock', Mood='Sad/Depressed') -> N=2460, RMSE=9.1665, R²=-0.0026 Saved: model_genre_alt-rock_mood_Sad_Depressed.pkl === Summary of (Genre, Mood) Models === Genre='black-metal', Mood='Happy/Energetic', N=614, RMSE=8.1305, R²=0.0936 Genre='black-metal', Mood='Angry/Tense', N=19265, RMSE=7.5115, R²=0.2530 Genre='black-metal', Mood='Sad/Depressed', N=1928, RMSE=7.1810, R²=0.2695 Genre='gospel', Mood='Happy/Energetic', N=7527, RMSE=8.6510, R²=0.1454 Genre='gospel', Mood='Angry/Tense', N=6137, RMSE=11.0252, R²=0.2640 Genre='gospel', Mood='Peaceful/Relaxed', N=1634, RMSE=7.7421, R²=0.0722 Genre='gospel', Mood='Sad/Depressed', N=6323, RMSE=10.3212, R²=0.2108 Genre='ambient', Mood='Happy/Energetic', N=694, RMSE=8.6806, R²=0.5388 Genre='ambient', Mood='Angry/Tense', N=2752, RMSE=8.8545, R²=0.5428 Genre='ambient', Mood='Peaceful/Relaxed', N=662, RMSE=8.9283, R²=0.4806 Genre='ambient', Mood='Sad/Depressed', N=17281, RMSE=9.4974, R²=0.4932 Genre='acoustic', Mood='Happy/Energetic', N=4276, RMSE=10.3877, R²=0.2886 Genre='acoustic', Mood='Angry/Tense', N=3268, RMSE=10.3847, R²=0.2644 Genre='acoustic', Mood='Peaceful/Relaxed', N=3231, RMSE=9.8868, R²=0.4829 Genre='acoustic', Mood='Sad/Depressed', N=10322, RMSE=10.2575, R²=0.4241 Genre='alt-rock', Mood='Happy/Energetic', N=7867, RMSE=10.3670, R²=0.1360 Genre='alt-rock', Mood='Angry/Tense', N=10062, RMSE=9.7566, R²=0.0406 Genre='alt-rock', Mood='Peaceful/Relaxed', N=529, RMSE=11.5521, R²=0.0642 Genre='alt-rock', Mood='Sad/Depressed', N=2460, RMSE=9.1665, R²=-0.0026
InĀ [Ā ]:
InĀ [23]:
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
#############################################################################
# 1. Filter Data to Top 5 Genres
#############################################################################
df = df_with_moods.copy()
# (Optional) drop columns you don't want
for col in ["artist_popularity"]:
if col in df.columns:
df.drop(columns=[col], inplace=True, errors='ignore')
# Identify top 5 genres
top_5_genres = df['genre'].value_counts().head(5).index.tolist()
print("Top 5 genres:", top_5_genres)
df_top5 = df[df['genre'].isin(top_5_genres)].copy()
print(f"Filtered to top-5 genres: {len(df_top5):,} rows.")
#############################################################################
# 2. Encode 'genre' + 'mood_label', Scale Numeric Columns
#############################################################################
df_top5 = pd.get_dummies(df_top5, columns=['genre'], drop_first=False)
df_top5 = pd.get_dummies(df_top5, columns=['mood_label'], prefix='mood', drop_first=False)
# List numeric columns to scale
numerical_columns = [
'year', 'danceability', 'energy', 'key', 'loudness',
'mode', 'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature'
]
scaler = StandardScaler()
for col in numerical_columns:
if col in df_top5.columns:
df_top5[col] = scaler.fit_transform(df_top5[[col]])
print("Encoding & scaling complete. Columns now:", df_top5.columns.tolist())
#############################################################################
# 3. Prepare Features & Target
#############################################################################
X = df_top5.drop(columns=['popularity','track_id','track_name','artist_name'], errors='ignore')
y = df_top5['popularity']
# (Optional) remove any leftover indexing columns
for c in ['Unnamed: 0']:
if c in X.columns:
X.drop(columns=[c], inplace=True)
print("Feature matrix size:", X.shape)
#############################################################################
# 4. K-Fold Cross-Validation on the Entire Top-5-Genre Dataset
#############################################################################
def cross_validate_rf(X, y, n_splits=5, random_seed=42):
kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_seed)
r2_scores = []
rmse_scores = []
for train_idx, valid_idx in kf.split(X):
X_train_fold = X.iloc[train_idx]
X_valid_fold = X.iloc[valid_idx]
y_train_fold = y.iloc[train_idx]
y_valid_fold = y.iloc[valid_idx]
# Train
rf = RandomForestRegressor(n_estimators=100, random_state=random_seed, n_jobs=-1)
rf.fit(X_train_fold, y_train_fold)
# Predict
preds_fold = rf.predict(X_valid_fold)
# Metrics
fold_r2 = r2_score(y_valid_fold, preds_fold)
fold_rmse = np.sqrt(mean_squared_error(y_valid_fold, preds_fold))
r2_scores.append(fold_r2)
rmse_scores.append(fold_rmse)
return np.mean(r2_scores), np.mean(rmse_scores)
avg_r2, avg_rmse = cross_validate_rf(X, y, n_splits=5)
print(f"\n=== K-Fold CV Results (Top-5 Genres) ===")
print(f"Average R² (5 folds): {avg_r2:.4f}")
print(f"Average RMSE (5 folds): {avg_rmse:.4f}")
#############################################################################
# 5. Final Model on ALL Data + Feature Importances
#############################################################################
final_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
final_model.fit(X, y)
# Evaluate on the same data (just to see how it fits overall)
preds_all = final_model.predict(X)
final_r2 = r2_score(y, preds_all)
final_rmse = np.sqrt(mean_squared_error(y, preds_all))
print(f"\n=== Final Model (Trained on entire top-5 dataset) ===")
print(f"R² on training set: {final_r2:.4f}")
print(f"RMSE on training set: {final_rmse:.4f}")
# Save the final model
joblib.dump(final_model, "model_general_top5_final.pkl")
print("Saved: model_general_top5_final.pkl")
# Feature Importances
feat_importances = pd.DataFrame({
'feature': X.columns,
'importance': final_model.feature_importances_
}).sort_values('importance', ascending=False).reset_index(drop=True)
print("\nTop 10 Features by Importance:")
print(feat_importances.head(10).to_string(index=False))
Top 5 genres: ['black-metal', 'gospel', 'ambient', 'acoustic', 'alt-rock']
Filtered to top-5 genres: 106,862 rows.
Encoding & scaling complete. Columns now: ['popularity', 'year', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'genre_acoustic', 'genre_alt-rock', 'genre_ambient', 'genre_black-metal', 'genre_gospel', 'mood_Angry/Tense', 'mood_Happy/Energetic', 'mood_Peaceful/Relaxed', 'mood_Sad/Depressed']
Feature matrix size: (106862, 23)
=== K-Fold CV Results (Top-5 Genres) ===
Average R² (5 folds): 0.5957
Average RMSE (5 folds): 9.3376
=== Final Model (Trained on entire top-5 dataset) ===
R² on training set: 0.9438
RMSE on training set: 3.4814
Saved: model_general_top5_final.pkl
Top 10 Features by Importance:
feature importance
genre_alt-rock 0.342569
year 0.161087
genre_black-metal 0.048945
duration_ms 0.048236
loudness 0.040977
speechiness 0.040854
tempo 0.040711
danceability 0.040605
acousticness 0.039572
valence 0.038940
InĀ [24]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
# 1. Assume we already have a DataFrame `df` with columns including "popularity".
df = df_with_moods.copy()
# A. Minimal data prep for demonstration (if you haven't already encoded, etc.)
# Let's just do a quick example with the numeric columns directly
# In practice, you'd do your usual data prep: drop columns, get_dummies, scale, etc.
feature_cols = ['danceability','energy','loudness','tempo'] # for simplicity
df = df.dropna(subset=['popularity'] + feature_cols) # remove rows missing these
X = df[feature_cols]
y = df['popularity']
# B. Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
################################################################################
# 2. Random Baseline
################################################################################
# - We'll create predictions for y_test by randomly sampling from y_train's min-max range.
# Alternatively, you could sample from the entire dataset's min-max.
y_min = y_train.min()
y_max = y_train.max()
# Generate random predictions for each test sample
random_preds = np.random.uniform(low=y_min, high=y_max, size=len(y_test))
# Evaluate random baseline
rand_rmse = np.sqrt(mean_squared_error(y_test, random_preds))
rand_r2 = r2_score(y_test, random_preds)
print(f"=== Random Baseline ===")
print(f"RMSE: {rand_rmse:.4f}")
print(f"R²: {rand_r2:.4f}\n")
################################################################################
# 3. Random Forest Model
################################################################################
model = RandomForestRegressor(n_estimators=50, random_state=42)
model.fit(X_train, y_train)
rf_preds = model.predict(X_test)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_preds))
rf_r2 = r2_score(y_test, rf_preds)
print(f"=== Random Forest ===")
print(f"RMSE: {rf_rmse:.4f}")
print(f"R²: {rf_r2:.4f}")
=== Random Baseline === RMSE: 45.6487 R²: -7.2732 === Random Forest === RMSE: 15.5293 R²: 0.0425
InĀ [25]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import joblib # For saving/loading models
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
# 1. Assume df_with_moods is already loaded in your environment
df = df_with_moods.copy()
# 2. Drop 'artist_popularity' as requested
if 'artist_popularity' in df.columns:
df.drop(columns=['artist_popularity'], inplace=True, errors='ignore')
# 2. Drop 'year' as requested
if 'year' in df.columns:
df.drop(columns=['year'], inplace=True, errors='ignore')
# 3. Identify all moods
moods = df['mood_label'].unique()
# 4. Find the smallest mood subset size
mood_subset_sizes = []
for mood in moods:
size = len(df[df['mood_label'] == mood])
mood_subset_sizes.append(size)
min_size = min(mood_subset_sizes)
# 5. Create the sampled data for the General Model (5% sample)
sample_fraction = 0.05
df_general_sample = df.sample(frac=sample_fraction, random_state=42)
print(f"Total dataset size: {len(df)}")
print(f"General model sample size (5%): {len(df_general_sample)}")
# 6. Create separate subsets for each mood with equal size = min_size
mood_dataframes = {}
for mood in moods:
mood_df = df[df['mood_label'] == mood]
# Sample 'min_size' rows from this mood subset
mood_sampled = mood_df.sample(n=min_size, random_state=42)
mood_dataframes[mood] = mood_sampled.copy()
# 7. Concatenate these mood dataframes if you need the *combined* mood dataset
# (Typically you won't combine them for separate training, but let's keep an option)
df_mood_equal_sized = pd.concat(mood_dataframes.values(), ignore_index=True)
# Encode categorical features
# For 'genre', drop the first category
df_general_encoded = pd.get_dummies(df_general_sample, columns=['genre'], drop_first=True)
# For 'mood_label', do NOT drop the first category
df_general_encoded = pd.get_dummies(df_general_encoded, columns=['mood_label'], drop_first=False, prefix='mood')
# Scale the relevant numerical columns (excluding 'artist_popularity', which we dropped)
numerical_columns = ['danceability', 'energy', 'key', 'loudness',
'mode', 'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']
scaler = StandardScaler()
df_general_encoded[numerical_columns] = scaler.fit_transform(df_general_encoded[numerical_columns])
# We also need to encode + scale for each mood subset
mood_dfs_encoded = {}
for mood, mood_df in mood_dataframes.items():
temp_df = pd.get_dummies(mood_df, columns=['genre'], drop_first=True)
temp_df = pd.get_dummies(temp_df, columns=['mood_label'], drop_first=False, prefix='mood')
temp_df[numerical_columns] = scaler.transform(temp_df[numerical_columns])
mood_dfs_encoded[mood] = temp_df.copy()
print("Data Preparation complete.")
Total dataset size: 1159748 General model sample size (5%): 57987 Data Preparation complete.
Model Training¶
InĀ [26]:
# 1. Identify features and target
X_general = df_general_encoded.drop(columns=['popularity', 'track_id', 'track_name', 'artist_name'], errors='ignore')
y_general = df_general_encoded['popularity']
# 2. Train/test split
X_train_gen, X_test_gen, y_train_gen, y_test_gen = train_test_split(X_general, y_general,
test_size=0.2, random_state=42)
# 3. Train a Random Forest
general_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
general_model.fit(X_train_gen, y_train_gen)
# 4. Evaluate
gen_preds = general_model.predict(X_test_gen)
gen_rmse = np.sqrt(mean_squared_error(y_test_gen, gen_preds))
gen_r2 = r2_score(y_test_gen, gen_preds)
print(f"General Model RMSE: {gen_rmse:.4f}")
print(f"General Model R²: {gen_r2:.4f}")
# 5. Save the model
joblib.dump(general_model, 'model_general.pkl')
print("General model saved as 'model_general.pkl'.")
General Model RMSE: 11.6495 General Model R²: 0.4551 General model saved as 'model_general.pkl'.
Mood Specific Model Training¶
InĀ [27]:
import re
mood_models = {} # Dictionary to store mood-specific models
mood_metrics = [] # List to store metrics for each mood
for mood, mdf in mood_dfs_encoded.items():
# 1. Identify X, y
X_mood = mdf.drop(columns=['popularity', 'track_id', 'track_name', 'artist_name'], errors='ignore')
y_mood = mdf['popularity']
# 2. Split
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
X_mood, y_mood, test_size=0.2, random_state=42
)
# 3. Train
mood_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
mood_model.fit(X_train_m, y_train_m)
# 4. Predict + Evaluate
mood_preds = mood_model.predict(X_test_m)
mood_rmse = np.sqrt(mean_squared_error(y_test_m, mood_preds))
mood_r2 = r2_score(y_test_m, mood_preds)
# 5. Save model
# Replace any characters that might break the file path (e.g. '/', ':', etc.) with '_'
safe_mood = re.sub(r'[^\w\-]+', '_', mood)
model_filename = f"model_mood_{safe_mood}.pkl"
joblib.dump(mood_model, model_filename)
print(f"Mood '{mood}' Model -> RMSE: {mood_rmse:.4f}, R²: {mood_r2:.4f}")
print(f"Saved as '{model_filename}'.\n")
# Store in dictionary and metrics
mood_models[mood] = mood_model
mood_metrics.append({
'mood': mood,
'rmse': mood_rmse,
'r2': mood_r2
})
Mood 'Sad/Depressed' Model -> RMSE: 11.3303, R²: 0.4737 Saved as 'model_mood_Sad_Depressed.pkl'. Mood 'Peaceful/Relaxed' Model -> RMSE: 10.6366, R²: 0.5202 Saved as 'model_mood_Peaceful_Relaxed.pkl'. Mood 'Angry/Tense' Model -> RMSE: 10.9768, R²: 0.5154 Saved as 'model_mood_Angry_Tense.pkl'. Mood 'Happy/Energetic' Model -> RMSE: 11.8286, R²: 0.4836 Saved as 'model_mood_Happy_Energetic.pkl'.
Performance Comparison & Visualizations¶
InĀ [28]:
# 1. Collect results for easy plotting
all_moods = [m['mood'] for m in mood_metrics]
rmse_vals = [m['rmse'] for m in mood_metrics]
r2_vals = [m['r2'] for m in mood_metrics]
# 2. Add the General model as a reference
general_rmse_val = gen_rmse
general_r2_val = gen_r2
# 3. Calculate average R² across all moods
avg_r2_val = sum(r2_vals) / len(r2_vals)
# Print them out
print(f"General Model R²: {general_r2_val:.4f}")
print(f"Average Mood-based R²: {avg_r2_val:.4f}")
# 4. Plot RMSE comparison
plt.figure()
plt.bar(all_moods, rmse_vals)
plt.axhline(y=general_rmse_val, linestyle='--', label='General Model')
plt.title("Mood-Specific RMSE Comparison (Lower = Better)")
plt.xlabel("Mood")
plt.ylabel("RMSE")
plt.legend()
plt.show()
# 5. Plot R² comparison
plt.figure()
plt.bar(all_moods, r2_vals)
plt.axhline(y=general_r2_val, linestyle='--', label='General Model')
plt.axhline(y=avg_r2_val, linestyle=':', label='Average Mood Models')
plt.title("Mood-Specific R² Comparison (Higher = Better)")
plt.xlabel("Mood")
plt.ylabel("R²")
plt.legend()
plt.show()
General Model R²: 0.4551 Average Mood-based R²: 0.4982
InĀ [29]:
importances_dict = {} # Will hold DataFrame of feature importances for each mood
# 1. Feature importances for General model
feature_names = X_general.columns # Same columns used for training the general model
gen_importances = general_model.feature_importances_
df_gen_importances = pd.DataFrame({
'feature': feature_names,
'importance': gen_importances
}).sort_values('importance', ascending=False).reset_index(drop=True)
importances_dict['General'] = df_gen_importances
# 2. Feature importances for each mood
for mood, model in mood_models.items():
X_mood = mood_dfs_encoded[mood].drop(columns=['popularity','track_id','track_name','artist_name'], errors='ignore')
mood_importances = model.feature_importances_
df_mood_importances = pd.DataFrame({
'feature': X_mood.columns,
'importance': mood_importances
}).sort_values('importance', ascending=False).reset_index(drop=True)
importances_dict[mood] = df_mood_importances
# 3. Create individual bar charts for each modelās top features
for key, df_imp in importances_dict.items():
plt.figure()
# Show top 10 for readability
top_n = 10
df_top = df_imp.head(top_n)
plt.barh(df_top['feature'][::-1], df_top['importance'][::-1])
plt.title(f"Top {top_n} Features for {key} Model")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()
# 4. Create a heatmap comparing feature rankings
# First, build a DataFrame where rows = features, columns = [General, mood1, mood2, ...],
# and values = rank of that feature in that model.
all_features = set()
for df_imp in importances_dict.values():
all_features.update(df_imp['feature'].tolist())
all_features = list(all_features) # convert to list to iterate
ranking_dict = {}
for feat in all_features:
ranking_dict[feat] = {}
for model_name, df_imp in importances_dict.items():
# Find the rank of this feature in df_imp
found = df_imp[df_imp['feature'] == feat]
if found.empty:
# Feature not present in this dataset's columns
ranking_dict[feat][model_name] = np.nan
else:
# rank is the index + 1 in the sorted DataFrame
rank = found.index[0] + 1 # 1-based ranking
ranking_dict[feat][model_name] = rank
df_rankings = pd.DataFrame(ranking_dict).T # transpose so that rows=features, columns=models
# Because rank "1" means "most important," smaller is better, but let's just plot them as is
plt.figure(figsize=(10, 12))
sns.heatmap(df_rankings, annot=True, fmt='.0f', cmap='YlGnBu')
plt.title("Feature Ranking Heatmap (1 = Most Important)")
plt.xlabel("Model")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()
InĀ [30]:
import pandas as pd
importances_dict = {} # Will hold DataFrames of feature importances for each model
###############################################
# 1. Feature importances for the General model
###############################################
feature_names = X_general.columns
gen_importances = general_model.feature_importances_
df_gen_importances = pd.DataFrame({
'feature': feature_names,
'importance': gen_importances
}).sort_values('importance', ascending=False).reset_index(drop=True)
importances_dict['General'] = df_gen_importances
######################################
# 2. Feature importances for each mood
######################################
for mood, model in mood_models.items():
X_mood = mood_dfs_encoded[mood].drop(
columns=['popularity','track_id','track_name','artist_name'],
errors='ignore'
)
mood_importances = model.feature_importances_
df_mood_importances = pd.DataFrame({
'feature': X_mood.columns,
'importance': mood_importances
}).sort_values('importance', ascending=False).reset_index(drop=True)
importances_dict[mood] = df_mood_importances
##########################################################################
# 3. Build a single table for all models' top 10 features (sorted by rank)
##########################################################################
table_rows = []
top_n = 10
for model_name, df_imp in importances_dict.items():
# Take the top 10
df_top = df_imp.head(top_n).reset_index(drop=True)
# Loop through each row in df_top
for idx in range(len(df_top)):
rank = idx + 1 # rank is index+1
feature = df_top.loc[idx, 'feature']
importance = df_top.loc[idx, 'importance']
table_rows.append((model_name, rank, feature, importance))
# Create a DataFrame with columns: Model, Rank, Feature, Importance
df_all_top10 = pd.DataFrame(table_rows, columns=["Model", "Rank", "Feature", "Importance"])
# If you want to see them grouped by model in ascending order of Rank:
df_all_top10.sort_values(by=["Model", "Rank"], inplace=True)
##########################################################################
# 4. Print the combined table (without DataFrame indices)
##########################################################################
print(df_all_top10.to_string(index=False))
Model Rank Feature Importance
Angry/Tense 1 instrumentalness 0.106996
Angry/Tense 2 duration_ms 0.082904
Angry/Tense 3 danceability 0.061281
Angry/Tense 4 loudness 0.054319
Angry/Tense 5 tempo 0.051187
Angry/Tense 6 speechiness 0.048335
Angry/Tense 7 acousticness 0.047567
Angry/Tense 8 valence 0.044285
Angry/Tense 9 liveness 0.043389
Angry/Tense 10 energy 0.041245
General 1 duration_ms 0.080018
General 2 loudness 0.058548
General 3 valence 0.056051
General 4 danceability 0.055783
General 5 tempo 0.052568
General 6 acousticness 0.050488
General 7 speechiness 0.049787
General 8 liveness 0.047686
General 9 energy 0.045311
General 10 instrumentalness 0.043273
Happy/Energetic 1 duration_ms 0.085945
Happy/Energetic 2 loudness 0.062887
Happy/Energetic 3 danceability 0.056359
Happy/Energetic 4 tempo 0.054205
Happy/Energetic 5 genre_hip-hop 0.053446
Happy/Energetic 6 valence 0.052908
Happy/Energetic 7 speechiness 0.052464
Happy/Energetic 8 acousticness 0.050262
Happy/Energetic 9 energy 0.049160
Happy/Energetic 10 genre_dance 0.048390
Peaceful/Relaxed 1 loudness 0.076338
Peaceful/Relaxed 2 genre_tango 0.072562
Peaceful/Relaxed 3 duration_ms 0.060630
Peaceful/Relaxed 4 danceability 0.051563
Peaceful/Relaxed 5 acousticness 0.049596
Peaceful/Relaxed 6 speechiness 0.046695
Peaceful/Relaxed 7 tempo 0.046318
Peaceful/Relaxed 8 valence 0.045229
Peaceful/Relaxed 9 liveness 0.042763
Peaceful/Relaxed 10 energy 0.041825
Sad/Depressed 1 duration_ms 0.077736
Sad/Depressed 2 loudness 0.070777
Sad/Depressed 3 danceability 0.055890
Sad/Depressed 4 speechiness 0.055878
Sad/Depressed 5 acousticness 0.052472
Sad/Depressed 6 instrumentalness 0.052097
Sad/Depressed 7 energy 0.050863
Sad/Depressed 8 valence 0.050830
Sad/Depressed 9 tempo 0.049620
Sad/Depressed 10 liveness 0.049401
Let's try to create genre specific model¶
InĀ [31]:
df_with_moods.head()
Out[31]:
| popularity | year | genre | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | time_signature | mood_label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 68 | 2012 | acoustic | 0.483 | 0.303 | 4 | -10.058 | 1 | 0.0429 | 0.6940 | 0.000000 | 0.1150 | 0.139 | 133.406 | 240166 | 3 | Sad/Depressed |
| 1 | 50 | 2012 | acoustic | 0.572 | 0.454 | 3 | -10.286 | 1 | 0.0258 | 0.4770 | 0.000014 | 0.0974 | 0.515 | 140.182 | 216387 | 4 | Peaceful/Relaxed |
| 2 | 57 | 2012 | acoustic | 0.409 | 0.234 | 3 | -13.711 | 1 | 0.0323 | 0.3380 | 0.000050 | 0.0895 | 0.145 | 139.832 | 158960 | 4 | Sad/Depressed |
| 3 | 58 | 2012 | acoustic | 0.392 | 0.251 | 10 | -9.845 | 1 | 0.0363 | 0.8070 | 0.000000 | 0.0797 | 0.508 | 204.961 | 304293 | 4 | Peaceful/Relaxed |
| 4 | 54 | 2012 | acoustic | 0.430 | 0.791 | 6 | -5.419 | 0 | 0.0302 | 0.0726 | 0.019300 | 0.1100 | 0.217 | 171.864 | 244320 | 4 | Angry/Tense |
Let's try Genre Specific Model!¶
InĀ [32]:
# Count occurrences of each genre
genre_counts = df_with_moods['genre'].value_counts()
# Get the top 5 genres (most frequent)
top_5_genres = genre_counts.head(5).index.tolist()
# Print the top 5 genres and their counts
print("Top 5 Genres (by frequency):")
for i, genre in enumerate(top_5_genres, start=1):
print(f"{i}. {genre} -> {genre_counts[genre]} rows")
Top 5 Genres (by frequency): 1. black-metal -> 21837 rows 2. gospel -> 21621 rows 3. ambient -> 21389 rows 4. acoustic -> 21097 rows 5. alt-rock -> 20918 rows
Data Prep: Scaling & Filtering¶
InĀ [33]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import joblib # for saving models
# Assume df_with_moods is already in the environment.
# We'll filter it to the top 5 genres you identified: alt-rock, country, dance, folk, hip-hop.
top_5_genres = ['alt-rock', 'country', 'dance', 'folk', 'hip-hop']
# 1. Filter rows
df_top5 = df_with_moods[df_with_moods['genre'].isin(top_5_genres)].copy()
# 2. (Optional) drop columns you do NOT want
# e.g., 'year', 'artist_popularity', etc., if they exist.
# Let's assume we're removing them for simplicity, as you've done before:
cols_to_drop = ['artist_popularity', 'year'] # Adjust if needed
for col in cols_to_drop:
if col in df_top5.columns:
df_top5.drop(columns=[col], inplace=True, errors='ignore')
# 3. Encode 'genre' and 'mood_label' as dummies
# - drop_first=True for 'genre' to avoid dummy trap, or keep them all if you prefer
df_top5 = pd.get_dummies(df_top5, columns=['genre'], drop_first=True)
df_top5 = pd.get_dummies(df_top5, columns=['mood_label'], drop_first=False, prefix='mood')
# 4. Scale numerical features
numerical_columns = [
'danceability', 'energy', 'key', 'loudness', 'mode',
'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature'
]
scaler = StandardScaler()
for col in numerical_columns:
if col in df_top5.columns:
df_top5[col] = scaler.fit_transform(df_top5[[col]])
print("Filtered to top-5 genres and prepared the data (encoded + scaled).")
print(f"New dataset size: {len(df_top5)} rows.")
Filtered to top-5 genres and prepared the data (encoded + scaled). New dataset size: 87886 rows.
Model Training¶
InĀ [34]:
import pandas as pd
import numpy as np
import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
# =============================================================================
# 1. Filter to Top-5 Genres
# =============================================================================
top_5_genres = ['alt-rock', 'country', 'dance', 'folk', 'hip-hop']
df_top5 = df_with_moods[df_with_moods['genre'].isin(top_5_genres)].copy()
print(f"Filtered dataset size (top-5 genres only): {len(df_top5)} rows.")
# =============================================================================
# 2. Drop Unwanted Columns (if present)
# =============================================================================
cols_to_drop = ['artist_popularity', 'year'] # Adjust if needed
for col in cols_to_drop:
if col in df_top5.columns:
df_top5.drop(columns=[col], inplace=True, errors='ignore')
# =============================================================================
# 3. Encode Genre & Mood Columns Once
# - We set drop_first=False so that *each* of the top 5 genres
# has its own dummy column in the final DataFrame.
# =============================================================================
df_top5 = pd.get_dummies(df_top5, columns=['genre'], drop_first=False)
df_top5 = pd.get_dummies(df_top5, columns=['mood_label'], prefix='mood', drop_first=False)
print("Dummy encoding complete. Current columns:")
print(df_top5.columns)
# =============================================================================
# 4. Scale Numeric Columns Once
# =============================================================================
numerical_columns = [
'danceability', 'energy', 'key', 'loudness', 'mode',
'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature'
]
scaler = StandardScaler()
for col in numerical_columns:
if col in df_top5.columns:
df_top5[col] = scaler.fit_transform(df_top5[[col]])
print("Scaling complete. Now training models...")
# =============================================================================
# 5. General Model (Top-5 Genres Combined)
# =============================================================================
# Define features & target
X_general = df_top5.drop(columns=['popularity', 'track_id', 'track_name', 'artist_name'], errors='ignore')
y_general = df_top5['popularity']
# Split
X_train_g, X_test_g, y_train_g, y_test_g = train_test_split(X_general, y_general, test_size=0.2, random_state=42)
# Train Random Forest
general_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
general_model.fit(X_train_g, y_train_g)
# Evaluate
gen_preds = general_model.predict(X_test_g)
gen_rmse = np.sqrt(mean_squared_error(y_test_g, gen_preds))
gen_r2 = r2_score(y_test_g, gen_preds)
print(f"General (top-5 genres) Model -> RMSE: {gen_rmse:.4f}, R²: {gen_r2:.4f}")
# Save the model
joblib.dump(general_model, 'model_general_top5_genres.pkl')
print("Saved 'model_general_top5_genres.pkl'.")
# =============================================================================
# 6. Genre-Specific Models
# =============================================================================
genre_models = {}
genre_metrics = []
for genre in top_5_genres:
# The dummy column for this genre will be "genre_{genre}"
dummy_col = f"genre_{genre}"
# Just filter the *already encoded & scaled* df_top5
# This ensures columns match exactly what the model expects.
if dummy_col not in df_top5.columns:
print(f"WARNING: Column '{dummy_col}' not found. Possibly it was the baseline if you had drop_first=True earlier.")
print(f"Skipping '{genre}' since its dummy column was not created.")
continue
df_genre_specific = df_top5[df_top5[dummy_col] == 1].copy()
X_genre = df_genre_specific.drop(columns=['popularity','track_id','track_name','artist_name'], errors='ignore')
y_genre = df_genre_specific['popularity']
if len(X_genre) < 50:
# If the subset is too small, skip it
print(f"Skipping genre '{genre}' -> only {len(X_genre)} rows.")
continue
X_train, X_test, y_train, y_test = train_test_split(X_genre, y_genre, test_size=0.2, random_state=42)
genre_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
genre_model.fit(X_train, y_train)
# Evaluate
preds = genre_model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
r2 = r2_score(y_test, preds)
# Save the model
model_filename = f"model_genre_{genre}.pkl"
joblib.dump(genre_model, model_filename)
print(f"Genre '{genre}' Model -> RMSE: {rmse:.4f}, R²: {r2:.4f}")
print(f"Saved '{model_filename}'.\n")
# Store metrics
genre_models[genre] = genre_model
genre_metrics.append({'genre': genre, 'rmse': rmse, 'r2': r2})
# Summarize Genre Results
print("=== Summary of Genre-Specific Models ===")
for m in genre_metrics:
print(f"{m['genre']}: RMSE={m['rmse']:.4f}, R²={m['r2']:.4f}")
Filtered dataset size (top-5 genres only): 87886 rows.
Dummy encoding complete. Current columns:
Index(['popularity', 'danceability', 'energy', 'key', 'loudness', 'mode',
'speechiness', 'acousticness', 'instrumentalness', 'liveness',
'valence', 'tempo', 'duration_ms', 'time_signature', 'genre_alt-rock',
'genre_country', 'genre_dance', 'genre_folk', 'genre_hip-hop',
'mood_Angry/Tense', 'mood_Happy/Energetic', 'mood_Peaceful/Relaxed',
'mood_Sad/Depressed'],
dtype='object')
Scaling complete. Now training models...
General (top-5 genres) Model -> RMSE: 12.1179, R²: 0.2181
Saved 'model_general_top5_genres.pkl'.
Genre 'alt-rock' Model -> RMSE: 10.3148, R²: 0.0075
Saved 'model_genre_alt-rock.pkl'.
Genre 'country' Model -> RMSE: 12.4682, R²: 0.1231
Saved 'model_genre_country.pkl'.
Genre 'dance' Model -> RMSE: 13.3145, R²: 0.1339
Saved 'model_genre_dance.pkl'.
Genre 'folk' Model -> RMSE: 11.2586, R²: 0.0646
Saved 'model_genre_folk.pkl'.
Genre 'hip-hop' Model -> RMSE: 13.3836, R²: 0.0524
Saved 'model_genre_hip-hop.pkl'.
=== Summary of Genre-Specific Models ===
alt-rock: RMSE=10.3148, R²=0.0075
country: RMSE=12.4682, R²=0.1231
dance: RMSE=13.3145, R²=0.1339
folk: RMSE=11.2586, R²=0.0646
hip-hop: RMSE=13.3836, R²=0.0524
InĀ [35]:
# Take a random sample to speed up execution
sample_fraction = 0.05 # Use 5% of the data
df_sample = df_with_moods.sample(frac=sample_fraction, random_state=42)
print(f"Original dataset: {len(df_with_moods)} songs")
print(f"Sampled dataset: {len(df_sample)} songs ({sample_fraction*100}%)")
# Continue with the sampled dataset
data = df_sample.copy()
Original dataset: 1159748 songs Sampled dataset: 57987 songs (5.0%)
InĀ [84]:
# First, save mood information before encoding
mood_mapping = {}
if 'mood_label' in df_with_moods.columns:
for idx, mood in enumerate(df_with_moods['mood_label'].unique()):
# Create mapping between original mood and encoded column name
mood_mapping[f'mood_label_{mood}'] = mood
# Process data
data = df_with_moods.copy()
categorical_columns = ['genre', 'mood_label']
numerical_columns = ['year', 'danceability', 'energy', 'key', 'loudness',
'mode', 'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature', 'artist_popularity']
# Encode categorical features
data = pd.get_dummies(data, columns=categorical_columns, drop_first=True)
# Scale numerical features
features_to_scale = ['year', 'tempo', 'duration_ms', 'artist_popularity']
scaler = StandardScaler()
data[features_to_scale] = scaler.fit_transform(data[features_to_scale])
# Now you can create a function to compare across moods using the dummy variables
def compare_general_vs_mood_specific(data, mood_mapping):
"""Compare general model vs. mood-specific models using dummy variables"""
# General model
X = data.drop(columns=['popularity', 'track_id', 'track_name', 'artist_name'], errors='ignore')
y = data['popularity']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
general_model = RandomForestRegressor(n_estimators=100, random_state=42)
general_model.fit(X_train, y_train)
general_pred = general_model.predict(X_test)
general_rmse = np.sqrt(mean_squared_error(y_test, general_pred))
general_r2 = r2_score(y_test, general_pred)
print(f"General model - RMSE: {general_rmse:.4f}, R²: {general_r2:.4f}")
# Mood-specific models
mood_results = []
for mood_col, original_mood in mood_mapping.items():
if mood_col in data.columns:
# Filter data for this mood
mood_data = data[data[mood_col] == 1].copy()
if len(mood_data) > 100: # Ensure enough samples
X_mood = mood_data.drop(columns=['popularity', 'track_id', 'track_name', 'artist_name'], errors='ignore')
y_mood = mood_data['popularity']
X_train_mood, X_test_mood, y_train_mood, y_test_mood = train_test_split(
X_mood, y_mood, test_size=0.2, random_state=42)
mood_model = RandomForestRegressor(n_estimators=100, random_state=42)
mood_model.fit(X_train_mood, y_train_mood)
mood_pred = mood_model.predict(X_test_mood)
mood_rmse = np.sqrt(mean_squared_error(y_test_mood, mood_pred))
mood_r2 = r2_score(y_test_mood, mood_pred)
print(f"{original_mood} model - RMSE: {mood_rmse:.4f}, R²: {mood_r2:.4f}")
mood_results.append({
'mood': original_mood,
'rmse': mood_rmse,
'r2': mood_r2,
'n_samples': len(mood_data)
})
# Visualize results
moods = [result['mood'] for result in mood_results]
rmse_values = [result['rmse'] for result in mood_results]
r2_values = [result['r2'] for result in mood_results]
plt.figure(figsize=(14, 6))
# RMSE comparison
plt.subplot(1, 2, 1)
bars = plt.bar(moods, rmse_values, color='skyblue')
plt.axhline(y=general_rmse, color='red', linestyle='--', label=f'General Model: {general_rmse:.4f}')
plt.title('RMSE by Mood (lower is better)')
plt.xlabel('Mood')
plt.ylabel('RMSE')
plt.xticks(rotation=45, ha='right')
plt.legend()
# R² comparison
plt.subplot(1, 2, 2)
bars = plt.bar(moods, r2_values, color='lightgreen')
plt.axhline(y=general_r2, color='red', linestyle='--', label=f'General Model: {general_r2:.4f}')
plt.title('R² by Mood (higher is better)')
plt.xlabel('Mood')
plt.ylabel('R²')
plt.xticks(rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()
return {
'general': {'rmse': general_rmse, 'r2': general_r2},
'mood_specific': mood_results
}
# Run the comparison
results = compare_general_vs_mood_specific(data, mood_mapping)
General model - RMSE: 7.4453, R²: 0.8117 Happy/Energetic model - RMSE: 7.9850, R²: 0.7937 Sad/Depressed model - RMSE: 7.5421, R²: 0.7865 Peaceful/Relaxed model - RMSE: 7.7174, R²: 0.7962
InĀ [86]:
import matplotlib.pyplot as plt
import seaborn as sns
# Create a boxplot of popularity by mood
plt.figure(figsize=(12, 6))
sns.boxplot(x='mood_label', y='popularity', data=df_with_moods)
plt.title('Popularity Distribution by Mood Category', fontsize=16)
plt.xlabel('Mood', fontsize=14)
plt.ylabel('Popularity Score', fontsize=14)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
# Add a violin plot for more detailed distribution
plt.figure(figsize=(12, 6))
sns.violinplot(x='mood_label', y='popularity', data=df_with_moods)
plt.title('Popularity Distribution by Mood (Violin Plot)', fontsize=16)
plt.xlabel('Mood', fontsize=14)
plt.ylabel('Popularity Score', fontsize=14)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate and display mean popularity by mood
mood_popularity = df_with_moods.groupby('mood_label')['popularity'].agg(['mean', 'median', 'std', 'count'])
mood_popularity = mood_popularity.sort_values('mean', ascending=False)
print("Popularity statistics by mood:")
print(mood_popularity)
# Create a bar chart of mean popularity by mood
plt.figure(figsize=(10, 5))
ax = sns.barplot(x=mood_popularity.index, y=mood_popularity['mean'])
plt.title('Average Popularity by Mood', fontsize=16)
plt.xlabel('Mood', fontsize=14)
plt.ylabel('Average Popularity Score', fontsize=14)
plt.grid(axis='y', alpha=0.3)
# Add the actual values on top of each bar
for i, p in enumerate(ax.patches):
ax.annotate(f"{p.get_height():.2f}",
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'bottom')
plt.tight_layout()
plt.show()
Popularity statistics by mood:
mean median std count
mood_label
Happy/Energetic 29.987169 29.0 17.570926 112700
Angry/Tense 28.158760 27.0 17.127361 121687
Sad/Depressed 27.377876 26.0 16.482864 65061
Peaceful/Relaxed 26.368377 24.0 17.045683 20403
InĀ [96]:
# Create a dictionary to store the trained models
mood_models = {}
# The general model training code might look like:
X = data.drop(columns=['popularity', 'track_id', 'track_name', 'artist_name'], errors='ignore')
y = data['popularity']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
general_model = RandomForestRegressor(n_estimators=100, random_state=42)
general_model.fit(X_train, y_train)
# Store the general model
mood_models['General'] = general_model
# For each mood, train a separate model and store it:
for mood_col, original_mood in mood_mapping.items():
if mood_col in data.columns:
# Filter data for this mood
mood_data = data[data[mood_col] == 1].copy()
if len(mood_data) > 100: # Ensure enough samples
X_mood = mood_data.drop(columns=['popularity', 'track_id', 'track_name', 'artist_name'], errors='ignore')
y_mood = mood_data['popularity']
X_train_mood, X_test_mood, y_train_mood, y_test_mood = train_test_split(
X_mood, y_mood, test_size=0.2, random_state=42)
mood_model = RandomForestRegressor(n_estimators=100, random_state=42)
mood_model.fit(X_train_mood, y_train_mood)
# Store the model
mood_models[original_mood] = mood_model
InĀ [97]:
# Get feature importances from the general model
X = data.drop(columns=['popularity', 'track_id', 'track_name', 'artist_name'], errors='ignore')
y = data['popularity']
# Train a model on a small sample for speed
sample_size = min(5000, len(X))
sample_indices = np.random.choice(len(X), sample_size, replace=False)
X_sample = X.iloc[sample_indices]
y_sample = y.iloc[sample_indices]
# Train a quick model
quick_model = RandomForestRegressor(n_estimators=50, random_state=42)
quick_model.fit(X_sample, y_sample)
# Get feature importances
features = X.columns
importances = pd.DataFrame({
'feature': features,
'importance': quick_model.feature_importances_
}).sort_values('importance', ascending=False)
# Plot top 15 features
plt.figure(figsize=(10, 8))
sns.barplot(x='importance', y='feature', data=importances.head(15))
plt.title('Top 15 Features for Popularity Prediction (General Model)', fontsize=14)
plt.xlabel('Importance', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.tight_layout()
plt.show()
InĀ [100]:
# Get feature importances for each mood model
def visualize_mood_feature_importances(mood_models, features):
"""
Visualize feature importances for each mood model
Parameters:
mood_models (dict): Dictionary of trained models by mood
features (list/Index): List of feature names
"""
# Create a figure with subplots - one for each mood
num_moods = len(mood_models)
fig, axes = plt.subplots(2, 2, figsize=(16, 14))
axes = axes.flatten()
# Process each mood model
for i, (mood, model) in enumerate(mood_models.items()):
if i < len(axes): # Make sure we don't exceed available subplots
# Extract feature importances
importances = pd.DataFrame({
'feature': features,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
# Plot top 15 features (or fewer if not available)
top_n = min(15, len(importances))
sns.barplot(x='importance', y='feature', data=importances.head(top_n), ax=axes[i])
axes[i].set_title(f'Top Features for {mood}', fontsize=14)
axes[i].set_xlabel('Importance', fontsize=12)
axes[i].set_ylabel('Feature', fontsize=12)
plt.tight_layout()
plt.show()
return {mood: model.feature_importances_ for mood, model in mood_models.items()}
# Run the visualization with your existing mood_models
visualize_mood_feature_importances(mood_models, features)
Out[100]:
{'General': array([3.00207830e-02, 1.81099172e-02, 1.54402855e-02, 9.23404102e-03,
1.85807792e-02, 1.88851240e-03, 1.86589669e-02, 1.71651083e-02,
1.32300994e-02, 1.83665274e-02, 1.74779082e-02, 1.82861334e-02,
1.87318313e-02, 1.54065995e-03, 7.12242513e-01, 1.27293683e-02,
5.06759665e-04, 1.73622982e-04, 7.86622590e-04, 1.81520992e-04,
8.13726560e-05, 8.06131841e-04, 5.66968354e-04, 3.22672360e-04,
7.80307863e-05, 3.15362473e-03, 1.46389532e-02, 4.51893426e-04,
2.85993190e-04, 7.96062888e-04, 2.71706621e-05, 5.41941647e-04,
3.62086699e-04, 5.38999471e-04, 3.57731900e-04, 1.23755385e-03,
1.30353791e-03, 5.99261552e-04, 4.76577482e-04, 3.15996466e-03,
4.97883483e-04, 3.27910555e-04, 3.29906122e-04, 3.24232405e-04,
1.71965479e-05, 4.23795068e-04, 2.59435025e-04, 2.87534388e-04,
5.57129504e-04, 4.08121783e-04, 1.00604581e-04, 1.96903221e-03,
2.91900459e-04, 1.58323167e-03, 2.16623721e-04, 5.84964534e-04,
2.40862964e-03, 3.41503612e-04, 1.41968631e-04, 2.82387070e-04,
3.72525688e-04, 3.19690310e-04, 4.84719136e-03, 1.20893617e-04,
6.54224091e-04, 2.48940130e-04, 4.39748187e-04, 1.84761538e-04,
1.49539913e-03, 2.23528266e-04, 1.14742795e-05, 2.39736515e-03,
5.17703016e-04, 3.99239679e-04, 2.51462233e-04, 1.82493591e-04,
7.67122825e-06, 3.01617222e-04, 1.60337909e-04, 2.39636350e-04,
3.02494619e-04, 7.88693539e-04, 5.68027828e-04, 4.72399559e-04]),
'Happy/Energetic': array([3.45164366e-02, 1.93443705e-02, 1.78763179e-02, 9.76786080e-03,
2.00329399e-02, 2.03710504e-03, 1.98904969e-02, 1.89607151e-02,
1.21071716e-02, 1.97487824e-02, 1.89405627e-02, 1.95942101e-02,
1.97555589e-02, 1.22418096e-03, 6.92732360e-01, 1.33008973e-02,
1.38506380e-04, 4.94068583e-05, 9.88791220e-04, 1.99506804e-04,
8.67960321e-05, 4.93296887e-04, 4.65959769e-05, 1.82301414e-04,
8.64863935e-05, 3.75995211e-03, 1.96382536e-02, 8.52962586e-04,
6.91606776e-05, 7.08403181e-04, 2.01313383e-05, 8.43928028e-04,
2.21411518e-04, 4.25258475e-04, 4.71249452e-05, 1.09122929e-03,
1.35257019e-03, 3.14784381e-04, 5.80763156e-04, 3.03190284e-03,
6.92954649e-04, 4.01063865e-04, 1.97199450e-04, 2.52617632e-04,
6.00231357e-06, 5.15472219e-04, 7.21282768e-05, 2.79159464e-04,
5.60888800e-04, 2.54223328e-04, 8.12924147e-05, 2.32676312e-03,
3.31865845e-04, 1.48465779e-03, 1.12859218e-04, 3.83418668e-04,
1.08466996e-03, 9.34502787e-05, 6.81835709e-05, 9.79847855e-06,
6.48966132e-04, 2.09689367e-04, 5.38584490e-03, 1.65445757e-04,
3.07385362e-04, 2.03595626e-04, 8.98009807e-04, 2.72801945e-04,
2.16006124e-03, 2.35555378e-04, 4.83167149e-06, 3.29784743e-03,
4.12577597e-04, 2.98491418e-04, 3.96109658e-04, 0.00000000e+00,
6.42686990e-06, 2.79777998e-04, 1.00640645e-04, 1.56125937e-04,
2.91654676e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]),
'Sad/Depressed': array([2.60292032e-02, 1.86200007e-02, 1.72168369e-02, 9.70492927e-03,
1.88722260e-02, 1.87884511e-03, 1.87112741e-02, 1.78425239e-02,
1.50414286e-02, 1.77732845e-02, 1.77481993e-02, 1.88259367e-02,
1.95133203e-02, 2.90029046e-03, 7.29065620e-01, 6.21413240e-03,
1.64673430e-03, 8.13409798e-05, 5.04928507e-04, 6.92930270e-06,
6.72032138e-05, 9.55765258e-04, 1.90036927e-03, 1.79333568e-04,
7.04545947e-05, 3.19155468e-03, 3.72626546e-03, 9.20492826e-05,
1.29665865e-05, 1.19774507e-04, 9.79475824e-06, 1.86781767e-04,
8.05209634e-05, 2.16141073e-04, 1.77619389e-05, 1.41637201e-04,
1.68773457e-03, 4.87431274e-04, 2.71764719e-04, 6.05263041e-03,
2.36972740e-04, 2.34150545e-04, 4.47839713e-04, 2.61222632e-04,
1.97175649e-06, 1.00924291e-04, 1.13736652e-03, 4.75386949e-05,
2.81326509e-05, 2.73430835e-05, 7.15512078e-06, 1.02498429e-03,
3.59316782e-05, 2.32675493e-03, 1.13586907e-04, 1.10044080e-03,
3.58832858e-04, 3.76655906e-06, 1.23770503e-04, 8.71613728e-04,
4.63418492e-05, 7.38953266e-04, 5.98562983e-03, 6.51430746e-05,
1.71093664e-05, 3.18286329e-04, 3.08493575e-05, 8.01767435e-05,
1.40568577e-03, 1.88777846e-04, 1.92648047e-05, 1.78512606e-03,
1.12977302e-03, 7.02248066e-04, 2.46248805e-05, 5.78469622e-04,
1.43826927e-05, 3.70796629e-04, 9.14459280e-05, 1.56785619e-05,
2.35016559e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]),
'Peaceful/Relaxed': array([2.80515010e-02, 1.79673760e-02, 1.57505915e-02, 9.25925263e-03,
2.20652818e-02, 1.71144120e-03, 1.80357734e-02, 1.63830549e-02,
1.19873038e-02, 1.78457590e-02, 1.76448318e-02, 1.84935429e-02,
1.74917832e-02, 2.09431726e-03, 7.19324526e-01, 9.58588208e-03,
3.32513283e-04, 5.53032857e-05, 1.49600824e-03, 6.56969180e-07,
1.27250633e-04, 1.02184583e-03, 5.96503176e-04, 1.80303894e-04,
2.25933126e-04, 5.19539692e-03, 9.93708827e-03, 2.17400491e-04,
1.20336977e-07, 4.13536939e-04, 3.92649702e-05, 3.60478402e-04,
4.73130479e-06, 4.42837969e-04, 1.47352119e-06, 3.81039179e-04,
1.12203908e-03, 5.75869776e-04, 7.79935853e-05, 4.35913770e-03,
8.38239121e-04, 2.84108157e-04, 2.03620309e-04, 1.10234165e-04,
4.62213678e-09, 1.13062259e-04, 4.19369123e-04, 8.23658510e-05,
1.31634245e-04, 5.82636038e-07, 3.64846501e-07, 1.78449698e-03,
9.60297163e-05, 1.53307484e-03, 2.74537446e-05, 1.24843142e-03,
2.00291595e-04, 0.00000000e+00, 4.22463320e-05, 1.49880006e-04,
1.44787470e-04, 2.24203228e-04, 4.28437696e-03, 5.11988603e-05,
1.76426464e-06, 3.79619983e-04, 1.55204933e-04, 3.39920914e-05,
4.69883291e-03, 1.03536941e-03, 5.81856752e-05, 7.59178655e-03,
1.04320373e-03, 8.47088820e-04, 1.60617023e-04, 2.98403822e-05,
5.58830111e-06, 8.68567807e-04, 9.11396410e-05, 2.74969210e-06,
1.69446306e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00])}
InĀ [102]:
# Check which mood columns exist in the data
mood_columns = [col for col in data.columns if 'mood_label' in col]
print("Mood columns in the data:", mood_columns)
# Check how many songs are in each mood category
for mood_col in mood_columns:
count = data[data[mood_col] == 1].shape[0]
print(f"{mood_col}: {count} songs")
Mood columns in the data: ['mood_label_Happy/Energetic', 'mood_label_Peaceful/Relaxed', 'mood_label_Sad/Depressed'] mood_label_Happy/Energetic: 112700 songs mood_label_Peaceful/Relaxed: 20403 songs mood_label_Sad/Depressed: 65061 songs